This is the mail archive of the
fortran@gcc.gnu.org
mailing list for the GNU Fortran project.
[gomp] omp performance question
- From: "Daniel Franke" <franke dot daniel at gmail dot com>
- To: fortran at gcc dot gnu dot org
- Date: Mon, 30 Oct 2006 12:09:21 +0100
- Subject: [gomp] omp performance question
- Domainkey-signature: a=rsa-sha1; q=dns; c=nofws; s=beta; d=gmail.com; h=received:message-id:date:from:to:subject:mime-version:content-type:content-transfer-encoding:content-disposition; b=VqCoJOjUqbnT58dhqFR4/XiTEbJuuc4Hwy2NSWg1iY8sqfkGgI6Y5XuBDDsGajlZOuv2qBLkucyOLvO0Td6x9Uiur/qXpd5dW/Hjm3g6zwcQmBCodev1cokK4xRTTiS60CY4N8rqPM1HZTjkIjMNGr5xcRH9EGE7XBRxzDRqP10=
Hi all,
hopefully this i s not off-topic, if it is, please let me know where
else to ask, thanks.
I am toying with the OpenMP implementation available in gfortran-4.2
(prerelease). After carefully profiling my program (gprof,
valgrind/callgrind), I identified two sections of code where approx
95% of execution time is spent, 47.x% each. Both sections have nested
DO loops similar to:
sum(:) = 0.0
DO l = 0, lmax
tmp(:) = 0.0
DO m = 0, l
tmp(:) = tmp(:) + ...
END DO
sum(:) = sum(:) + tmp(:) + ...
END DO
Therefore, I concluded OMP PARALLEL DO could improve matters, since
appropriate SMP hardware is available. Countering intuition, I found:
single threaded ( FCFLAGS=-O1) timings on x64_64, dual CPU (dual core
each), gave:
real 64m36.502s
user 64m36.886s
sys 0m0.040s
same machine, OMP enabled (FCFLAGS="-O1 -fopenmp"):
real 67m16.611s
user 112m22.885s
sys 25m44.641s
Due to an ICE in the intel fortran compiler (see [1-3]), I have no
means to compare these timings. Could someone with more experience
with the GNU OpenMP implementation comment on the actual code snippet
given below [4]?
Daniel
[1] http://gcc.gnu.org/bugzilla/show_bug.cgi?id=29629
[2] http://www.openmp.org/pipermail/omp/2006/000551.html
[3] http://www.openmp.org/pipermail/omp/2006/000552.html
[4] The actual computation, OpenMP statements included. Besides
CONJG(), no functions or subroutines are called, everything is
precomputed and stored within arrays. This function is easily called
20.000.000 times during an annealing procedure.
FUNCTION intensity(sa, s)
USE math, ONLY: PI
USE dammin_dam, ONLY: dam
TYPE(simulated_annealing), INTENT(in) :: sa
REAL(DBL), DIMENSION(:), INTENT(in) :: s
REAL(DBL), DIMENSION(size(s)) :: intensity
COMPLEX(DBL), DIMENSION(size(s)) :: Alm, Al0
REAL(DBL), DIMENSION(size(s)) :: sumAlm
INTEGER :: l, m
intensity = 0.0
!$OMP PARALLEL DO PRIVATE(l, m, Al0, Alm, sumAlm), REDUCTION(+:intensity)
DO l = 0, sa%max_harmonics ! sa%maxharmonics ~ 10 to 20
sumAlm = 0.0
DO m = 1, l
Alm = sa%current%alm(l, m, :)
sumAlm = sumAlm + 2.0 * Alm * CONJG(Alm)
END DO
Al0 = sa%current%alm(l, 0, :)
intensity = intensity + Al0 * CONJG(Al0) + sumAlm
END DO
!$OMP END PARALLEL DO
intensity = 2.0 * PI**2 * intensity
END FUNCTION