This is the mail archive of the
fortran@gcc.gnu.org
mailing list for the GNU Fortran project.
Performance analysis of Polyhedron/gas_dyn
- From: Janne Blomqvist <blomqvist dot janne at gmail dot com>
- To: gfortran <fortran at gcc dot gnu dot org>
- Date: Fri, 27 Apr 2007 10:38:57 +0300
- Subject: Performance analysis of Polyhedron/gas_dyn
- Dkim-signature: a=rsa-sha1; c=relaxed/relaxed; d=gmail.com; s=beta; h=domainkey-signature:received:received:message-id:date:from:user-agent:mime-version:to:subject:content-type; b=EN/y+VIuOwZQAKB+7G5MwNAxUZffbiSlADPRgP25SXB2duZMksScm3S3FR+dDwINuCNhaQ/mKaqPXH5yQ6u7mdt/JIJGuA7Kd/FrFM8CaTLNbXQwru9GqcpXoRdeNmyCD/4jDpQIGPmMY6p1cX74Ns5KvEm1/1dy+tGanaakWA4=
- Domainkey-signature: a=rsa-sha1; c=nofws; d=gmail.com; s=beta; h=received:message-id:date:from:user-agent:mime-version:to:subject:content-type; b=p+EzAN1mTx3u8TMutE3lQub/4emASb2RnY0D+NbGM69iIHKvsWK+zaJoOUs+WdTUGPolwfjtqum1kXTrBxFWtC4YhFhbjQP1OcChbSW6OGX1CDX82SA9iA10QWcR2uWDDkOkmU63KkGst39qdHn4e888v5WZ+U6cpy6Pr4uQQvE=
Hi,
I spent some time with oprofile, trying to figure out why we suck at the
gas_dyn benchmark in polyhedron. It turns out that there are two lines
that account for ~54% of the total runtime.
In subroutine CHOZDT we have the line
DTEMP = DX/(ABS(VEL) + SOUND)
and in subroutine EOS the line
CS(:NODES) = SQRT(CGAMMA*PRES(:NODES)/DENS(:NODES))
Both of these lines are array expressions, but they are quite simple and
gfortran manages to scalarize both of them without creating temporaries.
Both loops also vectorize nicely, which is important since gas_dyn is a
single precision program so vectorization is a real benefit on current
cpu:s (vectorization alone reduces runtime from 30s to 24s on my athlon 64).
You can find both subroutines simplified, with comments showing the
oprofile data for the CPU_CLK_UNHALTED (basically, runtime) and
L2_CACHE_MISS events for the critical lines, attached. For ifort, I had
to disable -ipo to get any results for CHOZDT (probably inlined), but
without -ipo I didn't get sensible results for EOS (seems like the line
numbers got messed up somehow for opannotate), so the results are not
entirely comparable. Nonetheless, the ifort timings change only
marginally due to -ipo, so it shouldn't make a big difference.
Ifort and other commercial compilers (I haven't tested others) still
manage to beat gfortran quite badly, see e.g.
http://www.polyhedron.com/
http://physik.fu-berlin.de/~tburnus/gcc-trunk/benchmark/
The reason, it seems, is that ifort (and presumably other commercial
compilers with competitive scores in gas_dyn) avoids calculating
divisions and square roots, replacing them with reciprocals and
reciprocal square roots. E.g. in EOS sqrt(a/b) can be calculated as
1/sqrt(b*(1/a)). This has a big impact on performance, since the SSE
instruction set contains very fast instructions for this, rcpps, rcpss,
rsqrtps, rsqrtss (PPC/Altivec also has equivalent instructions). These
instructions have latencies of 1-2 cycles vs. dozens or even hundreds of
cycles for normal division and square root. The price to be paid for
this speed is that these reciprocal instructions have an accuracy of
only 12 bits, so clearly they can be enabled only for -ffast-math. And
they are available only for single precision. I'll file a
missed-optimization PR about this.
--
Janne Blomqvist
SUBROUTINE CHOZDT(NODES, VEL, SOUND, DX, DT, STABF)
! *********************************
! CHOOSE TIME STEP
!
! STABF IS A STABILITY FACTOR
! *********************************
!...Translated by Pacific-Sierra Research VAST-90 1.02A2 13:53:43 1/12/93 -
IMPLICIT NONE
!-----------------------------------------------
! D u m m y A r g u m e n t s
!-----------------------------------------------
INTEGER NODES
REAL DT, STABF
REAL, DIMENSION(NODES) :: VEL, SOUND, DX
!-----------------------------------------------
! L o c a l P a r a m e t e r s
!-----------------------------------------------
INTEGER, PARAMETER :: IMAX = 50000
!-----------------------------------------------
! L o c a l V a r i a b l e s
!-----------------------------------------------
INTEGER :: ISET(1)
REAL :: VSET, SSET
REAL, DIMENSION (NODES) :: DTEMP
!-----------------------------------------------
! Profile for gfortran 4.3:
! CPU_CLK_UNHALTED L2_CACHE_MISS
! samp %runtim samp %tot
! 59887 22.4783 1484 10.9828 : DTEMP = DX/(ABS(VEL) + SOUND)
! ifort 9.1 profile
! 40104 16.2034 1198 8.8166 : DTEMP = DX/(ABS(VEL) + SOUND)
DTEMP = DX/(ABS(VEL) + SOUND)
ISET = MINLOC (DTEMP)
DT = DTEMP(ISET(1))
DT = STABF*DT
END SUBROUTINE CHOZDT
SUBROUTINE EOS(NODES, IENER, DENS, PRES, TEMP, GAMMA, CS, SHEAT, &
& CGAMMA, WT)
!
! EQUATION OF STATE
!
! INPUT:
!
! NODES INTEGER NUMBER OF CELLS IN MESH
! IENER REAL A. INTERNAL SPECIFIC ENERGY (J/KG)
! DENS REAL A. DENSITY (KG/M**3)
! SHEAT REAL CONSTANT SPECIFIC HEAT TO BE USED
! CGAMMA REAL CONSTANT GAMMA TO BE USED
!
! OUTPUT:
!
! PRES REAL A. PRESSURE (PASCALS)
! TEMP REAL A. TEMPERATURE (DEG K)
! GAMMA REAL A. THERMODYNAMIC GAMMA
! CS REAL A. SOUND SPEED (M/S)
!
! NOTE: THE ENTIRE MESH IS CALCULATED AT ONCE, SO THESE ARRAYS
! CONTAIN THE VARIABLES FOR EACH CELL
!
!...Translated by Pacific-Sierra Research VAST-90 1.02A2 13:53:43 1/12/93 -
IMPLICIT NONE
! INCLUDE 'tsolve.int'
!-----------------------------------------------
! D u m m y A r g u m e n t s
!-----------------------------------------------
INTEGER NODES
REAL SHEAT, CGAMMA, WT
REAL, DIMENSION(NODES) :: IENER, DENS, PRES, TEMP, GAMMA, CS
!-----------------------------------------------
! L o c a l P a r a m e t e r s
!-----------------------------------------------
REAL, PARAMETER :: RGAS = 8.314
!-----------------------------------------------
! L o c a l V a r i a b l e s
!-----------------------------------------------
REAL :: CONST
!-----------------------------------------------
! PARAMETER (RGAS=8.314)
!
! CONSTANT SPECIFIC HEAT AND GAMMA CALCULATIONS
!
! Profile data for gfortran 4.3
! CPU_CLK_UNHALTED L2_CACHE_MISS
! samp %runtim samp %tot
! 31418 11.7926 4211 31.1649 : TEMP(:NODES) = IENER(:NODES)/SHEAT
! 35696 13.3983 2204 16.3114 : PRES(:NODES) = (CGAMMA - 1.0)*DENS(:NODES)*IENER(:NODES)
! 18228 6.8418 4550 33.6738 : GAMMA(:NODES) = CGAMMA
! 83894 31.4893 255 1.8872 : CS(:NODES) = SQRT(CGAMMA*PRES(:NODES)/DENS(:NODES))
! Profile data for ifort 9.1
! 31740 12.8240 10900 19.7157 : TEMP(:NODES) = IENER(:NODES)/SHEAT
! 36477 14.7379 2939 5.3160 : PRES(:NODES) = (CGAMMA - 1.0)*DENS(:NODES)*IENER(:NODES)
! 17617 7.1179 4566 8.2589 : GAMMA(:NODES) = CGAMMA
! 40106 16.2042 1555 2.8126 : CS(:NODES) = SQRT(CGAMMA*PRES(:NODES)/DENS(:NODES))
! TEMP(:NODES) = IENER(:NODES)/SHEAT
! PRES(:NODES) = (CGAMMA - 1.0)*DENS(:NODES)*IENER(:NODES)
! GAMMA(:NODES) = CGAMMA
CS(:NODES) = SQRT(CGAMMA*PRES(:NODES)/DENS(:NODES))
END SUBROUTINE EOS