Created attachment 36716 [details] gzip tar file of Fortran and C source files Just recently, it has become necessary to add the omp simd safelen(1) directive in subroutine s111 in order to prevent a vectorization which reduces performance on all known IA targets other than Intel Xeon Phi. The same situation occurs in gcc/g++, and (for several years) icc/icpc (but not ifort). make -j 3 -f Makefile.cygwin lcd_ffast I haven't tested the latest gfortran build on linux, but I do have a Makefile for that, in case it's useful. In the Makefile, CLOCK_RATE is set to enable accurate translation from rdtsc ticks to seconds. The timing quotations for VL=100 and VL=1000 will show the reduced performance of s111 when vectorized by removing safelen(1) . For gcc and g++, functions s128() and s4113() also need vectorization disable for full performance, but gfortran doesn't exhibit that problem. For this filing, you can ignore everything but subroutine s111.
make: *** No rule to make target 'lcdmod.o', needed by 'lcd_mod.mod'. Stop. or Fatal Error: Can't open module file 'lcd_mod.mod' for reading at (1): No such file or directory What should be done for non cygwin platform?
Created attachment 36722 [details] fortran source gfortran -c lcdmod.f90 should take care of the missing .mod
Hmm, there are many loops here. I looked at the following (assuming the interesting loops are marked with safelen(1)) subroutine s111(ntimes,ld,n,ctime,dtime,a,b,c,d,e,aa,bb,cc) use lcd_mod C C linear dependence testing C no dependence - vectorizable C integer ntimes,ld,n,i,nl real a(n),b(n),c(n),d(n),e(n),aa(ld,n),bb(ld,n),cc(ld,n) real t1,t2,chksum,ctime,dtime,cs1d call init(ld,n,a,b,c,d,e,aa,bb,cc,'s111 ') call forttime(t1) do nl= 1,2*ntimes #ifndef __MIC__ !$omp simd safelen(1) #endif do i= 2,n,2 a(i)= a(i-1)+b(i) enddo call dummy(ld,n,a,b,c,d,e,aa,bb,cc,1.) enddo call forttime(t2) and current trunk doesn't consider this profitable unless -mavx is given (it needs the larger vector size for profitability it seems). Because of the step 2 it ends up using strided stores. Instead of doing interleaving on the loads and stores we could have just operated on all elements (rather than only even ones) and then use a masked store. That would waste half of the vector bandwidth but save all the shuffles. .L8: vmovups (%rdx), %xmm0 addl $1, %r9d addq $64, %rdx addq $64, %r11 vmovups -32(%rdx), %xmm2 vinsertf128 $0x1, -48(%rdx), %ymm0, %ymm1 vmovups -64(%r11), %xmm9 vinsertf128 $0x1, -16(%rdx), %ymm2, %ymm3 vmovups -32(%r11), %xmm11 vinsertf128 $0x1, -48(%r11), %ymm9, %ymm10 vinsertf128 $0x1, -16(%r11), %ymm11, %ymm12 vshufps $136, %ymm3, %ymm1, %ymm4 vshufps $136, %ymm12, %ymm10, %ymm13 vperm2f128 $3, %ymm4, %ymm4, %ymm5 vperm2f128 $3, %ymm13, %ymm13, %ymm14 vshufps $68, %ymm5, %ymm4, %ymm6 vshufps $238, %ymm5, %ymm4, %ymm7 vshufps $68, %ymm14, %ymm13, %ymm15 vshufps $238, %ymm14, %ymm13, %ymm0 vinsertf128 $1, %xmm7, %ymm6, %ymm8 vinsertf128 $1, %xmm0, %ymm15, %ymm1 vaddps %ymm1, %ymm8, %ymm2 vextractf128 $0x1, %ymm2, %xmm4 vmovss %xmm2, -60(%rdx) vextractps $1, %xmm2, -52(%rdx) vextractps $2, %xmm2, -44(%rdx) vextractps $3, %xmm2, -36(%rdx) vmovss %xmm4, -28(%rdx) vextractps $1, %xmm4, -20(%rdx) vextractps $2, %xmm4, -12(%rdx) vextractps $3, %xmm4, -4(%rdx) cmpl %r9d, %ecx ja .L8 what we fail to realize here is that cross-lane interleaving isn't working with AVX256 and thus the interleave for the loads is very much more expensive than we think. That's a general vectorizer cost model issue: /* Uses an even and odd extract operations or shuffle operations for each needed permute. */ int nstmts = ncopies * ceil_log2 (group_size) * group_size; inside_cost = record_stmt_cost (body_cost_vec, nstmts, vec_perm, stmt_info, 0, vect_body); which 1) doesn't consider single-element interleaving differently, 2) simply uses vec_perm cost which heavily depends on the actual (constant) permutation used
On 11/16/2015 7:13 AM, rguenth at gcc dot gnu.org wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68365 > > Richard Biener <rguenth at gcc dot gnu.org> changed: > > What |Removed |Added > ---------------------------------------------------------------------------- > Status|WAITING |NEW > > --- Comment #3 from Richard Biener <rguenth at gcc dot gnu.org> --- > Hmm, there are many loops here. I looked at the following (assuming the > interesting loops are marked with safelen(1)) > > subroutine s111(ntimes,ld,n,ctime,dtime,a,b,c,d,e,aa,bb,cc) > use lcd_mod > C > C linear dependence testing > C no dependence - vectorizable > C > integer ntimes,ld,n,i,nl > real a(n),b(n),c(n),d(n),e(n),aa(ld,n),bb(ld,n),cc(ld,n) > real t1,t2,chksum,ctime,dtime,cs1d > call init(ld,n,a,b,c,d,e,aa,bb,cc,'s111 ') > call forttime(t1) > do nl= 1,2*ntimes > #ifndef __MIC__ > !$omp simd safelen(1) > #endif > do i= 2,n,2 > a(i)= a(i-1)+b(i) > enddo > call dummy(ld,n,a,b,c,d,e,aa,bb,cc,1.) > enddo > call forttime(t2) > > and current trunk doesn't consider this profitable unless -mavx is given > (it needs the larger vector size for profitability it seems). > > Because of the step 2 it ends up using strided stores. Instead of > doing interleaving on the loads and stores we could have just operated > on all elements (rather than only even ones) and then use a masked > store. That would waste half of the vector bandwidth but save all the > shuffles. > > .L8: > vmovups (%rdx), %xmm0 > addl $1, %r9d > addq $64, %rdx > addq $64, %r11 > vmovups -32(%rdx), %xmm2 > vinsertf128 $0x1, -48(%rdx), %ymm0, %ymm1 > vmovups -64(%r11), %xmm9 > vinsertf128 $0x1, -16(%rdx), %ymm2, %ymm3 > vmovups -32(%r11), %xmm11 > vinsertf128 $0x1, -48(%r11), %ymm9, %ymm10 > vinsertf128 $0x1, -16(%r11), %ymm11, %ymm12 > vshufps $136, %ymm3, %ymm1, %ymm4 > vshufps $136, %ymm12, %ymm10, %ymm13 > vperm2f128 $3, %ymm4, %ymm4, %ymm5 > vperm2f128 $3, %ymm13, %ymm13, %ymm14 > vshufps $68, %ymm5, %ymm4, %ymm6 > vshufps $238, %ymm5, %ymm4, %ymm7 > vshufps $68, %ymm14, %ymm13, %ymm15 > vshufps $238, %ymm14, %ymm13, %ymm0 > vinsertf128 $1, %xmm7, %ymm6, %ymm8 > vinsertf128 $1, %xmm0, %ymm15, %ymm1 > vaddps %ymm1, %ymm8, %ymm2 > vextractf128 $0x1, %ymm2, %xmm4 > vmovss %xmm2, -60(%rdx) > vextractps $1, %xmm2, -52(%rdx) > vextractps $2, %xmm2, -44(%rdx) > vextractps $3, %xmm2, -36(%rdx) > vmovss %xmm4, -28(%rdx) > vextractps $1, %xmm4, -20(%rdx) > vextractps $2, %xmm4, -12(%rdx) > vextractps $3, %xmm4, -4(%rdx) > cmpl %r9d, %ecx > ja .L8 > > what we fail to realize here is that cross-lane interleaving isn't working > with AVX256 and thus the interleave for the loads is very much more expensive > than we think. > > That's a general vectorizer cost model issue: > > /* Uses an even and odd extract operations or shuffle operations > for each needed permute. */ > int nstmts = ncopies * ceil_log2 (group_size) * group_size; > inside_cost = record_stmt_cost (body_cost_vec, nstmts, vec_perm, > stmt_info, 0, vect_body); > > which 1) doesn't consider single-element interleaving differently, > 2) simply uses vec_perm cost which heavily depends on the actual > (constant) permutation used > Thanks for the interesting analysis. icc/icpc take safelen(1) as preventing vectorization for this case, but I found another stride 2 case where they still perform the unprofitable AVX vectorization. Maybe I'll submit an Intel PR (IPS).
Sorry, will attach that source file. Sent via the ASUS PadFone X mini, an AT&T 4G LTE smartphone -------- Original Message -------- From:"dominiq at lps dot ens.fr" <gcc-bugzilla@gcc.gnu.org> Sent:Sun, 15 Nov 2015 14:37:23 -0500 To:tprince@computer.org Subject:[Bug fortran/68365] gfortran test case showing performance loss with vectorization >https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68365 > >Dominique d'Humieres <dominiq at lps dot ens.fr> changed: > > What |Removed |Added >---------------------------------------------------------------------------- > Status|UNCONFIRMED |WAITING > Last reconfirmed| |2015-11-15 > Ever confirmed|0 |1 > >--- Comment #1 from Dominique d'Humieres <dominiq at lps dot ens.fr> --- >make: *** No rule to make target 'lcdmod.o', needed by 'lcd_mod.mod'. Stop. > >or > >Fatal Error: Can't open module file 'lcd_mod.mod' for reading at (1): No such >file or directory > >What should be done for non cygwin platform? > >-- >You are receiving this mail because: >You are on the CC list for the bug. >You reported the bug.