[Bug tree-optimization/84037] [8 Regression] Speed regression of polyhedron benchmark since r256644
rguenth at gcc dot gnu.org
gcc-bugzilla@gcc.gnu.org
Mon Jan 29 10:54:00 GMT 2018
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84037
--- Comment #9 from Richard Biener <rguenth at gcc dot gnu.org> ---
(In reply to Martin Liška from comment #7)
> (In reply to Jakub Jelinek from comment #6)
> > Is it really r256643 and not r256644 that is causing this though?
>
> Yes, I can verify that it's r256644 that's causing the regression.
This means that those newly vectorized loops make capacita slower. 551 is
do m=1,n/4-1
h = A(j0+m) + A(j2+m)*E(m*inc)
A(j2+m) = A(j0+m) - A(j2+m)*E(m*inc)
A(j0+m) = h
eh = conjg(E(ntot/4-m*inc))
h = A(j1+m) - A(j3+m)*eh
A(j3+m) = A(j1+m) + A(j3+m)*eh
A(j1+m) = h
end do
but it's actually the loops from the array expressions I guess (receiving
"interesting" locations). 105 is
do i=1,Ng1 ! .. and multiply charge with x
-> do j=1,Ng2
X(i,j) = X(i,j) * D1 * (i-(Ng1+1)/2.0_dp)
end do
end do
the variable strides are because those are arrays accessed via array
descriptors. This means that the stride will be very likely one so
any "strided XY" vectorization will have quite a big overhead.
Of course in the end it looks like a cost model issue (which might be
just not enough factoring in of the alias runtime check). Like in
the 105 case I expect the non-vectorized loop to be a quite "nice"
optimized nest with IVO doing a good job, etc.. If the inner loop
is vectorized conditionally this can wreck code generation and
runtime quite a bit. Ng1/Ng2 are 1024 (at runtime, read from capacita.in).
For 105 we have
capacita.f90:105:0: note: need run-time check that (ssizetype) ((sizetype)
prephitmp_341 * 4) is nonzero
capacita.f90:105:0: note: versioning for alias required: can't determine
dependence between d1 and *_150[_65]
so its two checks needed.
capacita.f90:105:0: note: Cost model analysis:
Vector inside of loop cost: 172
Vector prologue cost: 60
Vector epilogue cost: 136
Scalar iteration cost: 60
Scalar outside cost: 8
Vector outside cost: 196
prologue iterations: 0
epilogue iterations: 2
Calculated minimum iters for profitability: 7
I think the bug is that we're somehow thinking the vectorized arithmetic
(two multiplications) offset the use of strided loads and stores...
Testcase for this loop:
module solv_cap
implicit none
public :: solveP
integer, parameter, public :: dp = selected_real_kind(5)
real(kind=dp), private :: Pi, eps0
real(kind=dp), private :: D1, D2
integer, private, save :: Ng1=0, Ng2=0
integer, private, pointer, dimension(:,:) :: Grid
contains
subroutine solveP(P)
real(kind=dp), intent(out) :: P
real(kind=dp), allocatable, dimension(:,:) :: Y0, X
integer :: i,j
allocate( Y0(Ng1,Ng2), X(Ng1,Ng2) )
do i=1,Ng1
do j=1,Ng2
Y0(i,j) = D1 * (i-(Ng1+1)/2.0_dp) * Grid(i,j)
end do
end do ! RHS for in-field E_x=1. V = -V_in = x, where metal on grid
call solve( X, Y0 )
X = X - sum(X)/size(X) ! get rid of monopole term ..
do i=1,Ng1 ! .. and multiply charge with x
do j=1,Ng2
X(i,j) = X(i,j) * D1 * (i-(Ng1+1)/2.0_dp)
end do
end do
P = sum(X)*D1*D2 * 4*Pi*eps0 ! E-dipole moment in 1 V/m field
deallocate( X, Y0 )
return
end subroutine solveP
end module solv_cap
More information about the Gcc-bugs
mailing list