This is the mail archive of the fortran@gcc.gnu.org mailing list for the GNU Fortran project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Performance analysis of Polyhedron/gas_dyn


Hi,

I spent some time with oprofile, trying to figure out why we suck at the
gas_dyn benchmark in polyhedron. It turns out that there are two lines
that account for ~54% of the total runtime.

In subroutine CHOZDT we have the line

DTEMP = DX/(ABS(VEL) + SOUND)

and in subroutine EOS the line

CS(:NODES) = SQRT(CGAMMA*PRES(:NODES)/DENS(:NODES))

Both of these lines are array expressions, but they are quite simple and gfortran manages to scalarize both of them without creating temporaries. Both loops also vectorize nicely, which is important since gas_dyn is a single precision program so vectorization is a real benefit on current cpu:s (vectorization alone reduces runtime from 30s to 24s on my athlon 64).

You can find both subroutines simplified, with comments showing the oprofile data for the CPU_CLK_UNHALTED (basically, runtime) and L2_CACHE_MISS events for the critical lines, attached. For ifort, I had to disable -ipo to get any results for CHOZDT (probably inlined), but without -ipo I didn't get sensible results for EOS (seems like the line numbers got messed up somehow for opannotate), so the results are not entirely comparable. Nonetheless, the ifort timings change only marginally due to -ipo, so it shouldn't make a big difference.

Ifort and other commercial compilers (I haven't tested others) still manage to beat gfortran quite badly, see e.g.

http://www.polyhedron.com/

http://physik.fu-berlin.de/~tburnus/gcc-trunk/benchmark/

The reason, it seems, is that ifort (and presumably other commercial compilers with competitive scores in gas_dyn) avoids calculating divisions and square roots, replacing them with reciprocals and reciprocal square roots. E.g. in EOS sqrt(a/b) can be calculated as 1/sqrt(b*(1/a)). This has a big impact on performance, since the SSE instruction set contains very fast instructions for this, rcpps, rcpss, rsqrtps, rsqrtss (PPC/Altivec also has equivalent instructions). These instructions have latencies of 1-2 cycles vs. dozens or even hundreds of cycles for normal division and square root. The price to be paid for this speed is that these reciprocal instructions have an accuracy of only 12 bits, so clearly they can be enabled only for -ffast-math. And they are available only for single precision. I'll file a missed-optimization PR about this.

--
Janne Blomqvist

      SUBROUTINE CHOZDT(NODES, VEL, SOUND, DX, DT, STABF)
!                                      *********************************
!                                      CHOOSE TIME STEP
!
!                                      STABF IS A STABILITY FACTOR
!                                      *********************************
!...Translated by Pacific-Sierra Research VAST-90 1.02A2  13:53:43   1/12/93   -
      IMPLICIT NONE
!-----------------------------------------------
!   D u m m y   A r g u m e n t s
!-----------------------------------------------
      INTEGER NODES
      REAL DT, STABF
      REAL, DIMENSION(NODES) :: VEL, SOUND, DX
!-----------------------------------------------
!   L o c a l   P a r a m e t e r s
!-----------------------------------------------
      INTEGER, PARAMETER :: IMAX = 50000
!-----------------------------------------------
!   L o c a l   V a r i a b l e s
!-----------------------------------------------
      INTEGER :: ISET(1)
      REAL :: VSET, SSET
      REAL, DIMENSION (NODES) :: DTEMP
!-----------------------------------------------
! Profile for gfortran 4.3:
! CPU_CLK_UNHALTED L2_CACHE_MISS
! samp  %runtim  samp %tot
! 59887 22.4783  1484 10.9828   :      DTEMP = DX/(ABS(VEL) + SOUND)
! ifort 9.1 profile
! 40104 16.2034  1198  8.8166   :      DTEMP = DX/(ABS(VEL) + SOUND)

      DTEMP = DX/(ABS(VEL) + SOUND)
      ISET = MINLOC (DTEMP)
      DT = DTEMP(ISET(1))
      DT = STABF*DT
      END SUBROUTINE CHOZDT
       SUBROUTINE EOS(NODES, IENER, DENS, PRES, TEMP, GAMMA, CS, SHEAT,  &
     &    CGAMMA, WT)
!
!        EQUATION OF STATE
!
!        INPUT:
!
!        NODES     INTEGER     NUMBER OF CELLS IN MESH
!        IENER     REAL A.     INTERNAL SPECIFIC ENERGY (J/KG)
!        DENS      REAL A.     DENSITY (KG/M**3)
!        SHEAT     REAL        CONSTANT SPECIFIC HEAT TO BE USED
!        CGAMMA    REAL        CONSTANT GAMMA TO BE USED
!
!        OUTPUT:
!
!        PRES      REAL A.     PRESSURE (PASCALS)
!        TEMP      REAL A.     TEMPERATURE (DEG K)
!        GAMMA     REAL A.     THERMODYNAMIC GAMMA
!        CS        REAL A.     SOUND SPEED (M/S)
!
!        NOTE:  THE ENTIRE MESH IS CALCULATED AT ONCE, SO THESE ARRAYS
!               CONTAIN THE VARIABLES FOR EACH CELL
!
!...Translated by Pacific-Sierra Research VAST-90 1.02A2  13:53:43   1/12/93   -
      IMPLICIT NONE
!      INCLUDE 'tsolve.int'
!-----------------------------------------------
!   D u m m y   A r g u m e n t s
!-----------------------------------------------
      INTEGER NODES
      REAL SHEAT, CGAMMA, WT
      REAL, DIMENSION(NODES) :: IENER, DENS, PRES, TEMP, GAMMA, CS
!-----------------------------------------------
!   L o c a l   P a r a m e t e r s
!-----------------------------------------------
      REAL, PARAMETER :: RGAS = 8.314
!-----------------------------------------------
!   L o c a l   V a r i a b l e s
!-----------------------------------------------
      REAL :: CONST
!-----------------------------------------------
!     PARAMETER (RGAS=8.314)
!
!        CONSTANT SPECIFIC HEAT AND GAMMA CALCULATIONS
!
! Profile data for gfortran 4.3
! CPU_CLK_UNHALTED L2_CACHE_MISS
! samp  %runtim  samp %tot
! 31418 11.7926  4211 31.1649   :          TEMP(:NODES) = IENER(:NODES)/SHEAT
! 35696 13.3983  2204 16.3114   :          PRES(:NODES) = (CGAMMA - 1.0)*DENS(:NODES)*IENER(:NODES)
! 18228  6.8418  4550 33.6738   :          GAMMA(:NODES) = CGAMMA
! 83894 31.4893   255  1.8872   :          CS(:NODES) = SQRT(CGAMMA*PRES(:NODES)/DENS(:NODES))
! Profile data for ifort 9.1
! 31740 12.8240 10900 19.7157   :          TEMP(:NODES) = IENER(:NODES)/SHEAT
! 36477 14.7379  2939  5.3160   :          PRES(:NODES) = (CGAMMA - 1.0)*DENS(:NODES)*IENER(:NODES)
! 17617  7.1179  4566  8.2589   :          GAMMA(:NODES) = CGAMMA
! 40106 16.2042  1555  2.8126   :          CS(:NODES) = SQRT(CGAMMA*PRES(:NODES)/DENS(:NODES))

!      TEMP(:NODES) = IENER(:NODES)/SHEAT
!      PRES(:NODES) = (CGAMMA - 1.0)*DENS(:NODES)*IENER(:NODES)
!      GAMMA(:NODES) = CGAMMA
      CS(:NODES) = SQRT(CGAMMA*PRES(:NODES)/DENS(:NODES))

      END SUBROUTINE EOS

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]