This is the mail archive of the
gcc-bugs@gcc.gnu.org
mailing list for the GCC project.
[Bug target/68365] gfortran test case showing performance loss with vectorization
- From: "rguenth at gcc dot gnu.org" <gcc-bugzilla at gcc dot gnu dot org>
- To: gcc-bugs at gcc dot gnu dot org
- Date: Mon, 16 Nov 2015 12:13:51 +0000
- Subject: [Bug target/68365] gfortran test case showing performance loss with vectorization
- Auto-submitted: auto-generated
- References: <bug-68365-4 at http dot gcc dot gnu dot org/bugzilla/>
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68365
Richard Biener <rguenth at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Status|WAITING |NEW
--- Comment #3 from Richard Biener <rguenth at gcc dot gnu.org> ---
Hmm, there are many loops here. I looked at the following (assuming the
interesting loops are marked with safelen(1))
subroutine s111(ntimes,ld,n,ctime,dtime,a,b,c,d,e,aa,bb,cc)
use lcd_mod
C
C linear dependence testing
C no dependence - vectorizable
C
integer ntimes,ld,n,i,nl
real a(n),b(n),c(n),d(n),e(n),aa(ld,n),bb(ld,n),cc(ld,n)
real t1,t2,chksum,ctime,dtime,cs1d
call init(ld,n,a,b,c,d,e,aa,bb,cc,'s111 ')
call forttime(t1)
do nl= 1,2*ntimes
#ifndef __MIC__
!$omp simd safelen(1)
#endif
do i= 2,n,2
a(i)= a(i-1)+b(i)
enddo
call dummy(ld,n,a,b,c,d,e,aa,bb,cc,1.)
enddo
call forttime(t2)
and current trunk doesn't consider this profitable unless -mavx is given
(it needs the larger vector size for profitability it seems).
Because of the step 2 it ends up using strided stores. Instead of
doing interleaving on the loads and stores we could have just operated
on all elements (rather than only even ones) and then use a masked
store. That would waste half of the vector bandwidth but save all the
shuffles.
.L8:
vmovups (%rdx), %xmm0
addl $1, %r9d
addq $64, %rdx
addq $64, %r11
vmovups -32(%rdx), %xmm2
vinsertf128 $0x1, -48(%rdx), %ymm0, %ymm1
vmovups -64(%r11), %xmm9
vinsertf128 $0x1, -16(%rdx), %ymm2, %ymm3
vmovups -32(%r11), %xmm11
vinsertf128 $0x1, -48(%r11), %ymm9, %ymm10
vinsertf128 $0x1, -16(%r11), %ymm11, %ymm12
vshufps $136, %ymm3, %ymm1, %ymm4
vshufps $136, %ymm12, %ymm10, %ymm13
vperm2f128 $3, %ymm4, %ymm4, %ymm5
vperm2f128 $3, %ymm13, %ymm13, %ymm14
vshufps $68, %ymm5, %ymm4, %ymm6
vshufps $238, %ymm5, %ymm4, %ymm7
vshufps $68, %ymm14, %ymm13, %ymm15
vshufps $238, %ymm14, %ymm13, %ymm0
vinsertf128 $1, %xmm7, %ymm6, %ymm8
vinsertf128 $1, %xmm0, %ymm15, %ymm1
vaddps %ymm1, %ymm8, %ymm2
vextractf128 $0x1, %ymm2, %xmm4
vmovss %xmm2, -60(%rdx)
vextractps $1, %xmm2, -52(%rdx)
vextractps $2, %xmm2, -44(%rdx)
vextractps $3, %xmm2, -36(%rdx)
vmovss %xmm4, -28(%rdx)
vextractps $1, %xmm4, -20(%rdx)
vextractps $2, %xmm4, -12(%rdx)
vextractps $3, %xmm4, -4(%rdx)
cmpl %r9d, %ecx
ja .L8
what we fail to realize here is that cross-lane interleaving isn't working
with AVX256 and thus the interleave for the loads is very much more expensive
than we think.
That's a general vectorizer cost model issue:
/* Uses an even and odd extract operations or shuffle operations
for each needed permute. */
int nstmts = ncopies * ceil_log2 (group_size) * group_size;
inside_cost = record_stmt_cost (body_cost_vec, nstmts, vec_perm,
stmt_info, 0, vect_body);
which 1) doesn't consider single-element interleaving differently,
2) simply uses vec_perm cost which heavily depends on the actual
(constant) permutation used