This is the mail archive of the fortran@gcc.gnu.org mailing list for the GNU Fortran project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: Performance analysis of Polyhedron/gas_dyn


On 4/27/07, Janne Blomqvist <blomqvist.janne@gmail.com> wrote:
Hi,

I spent some time with oprofile, trying to figure out why we suck at the
gas_dyn benchmark in polyhedron. It turns out that there are two lines
that account for ~54% of the total runtime.

In subroutine CHOZDT we have the line

DTEMP = DX/(ABS(VEL) + SOUND)

and in subroutine EOS the line

CS(:NODES) = SQRT(CGAMMA*PRES(:NODES)/DENS(:NODES))

See also http://www.suse.de/~gcctest/c++bench/polyhedron/analysis.html (same conclusion for gas_dyn).

Both of these lines are array expressions, but they are quite simple and
gfortran manages to scalarize both of them without creating temporaries.
Both loops also vectorize nicely, which is important since gas_dyn is a
single precision program so vectorization is a real benefit on current
cpu:s (vectorization alone reduces runtime from 30s to 24s on my athlon 64).

You can find both subroutines simplified, with comments showing the
oprofile data for the CPU_CLK_UNHALTED (basically, runtime) and
L2_CACHE_MISS events for the critical lines, attached. For ifort, I had
to disable -ipo to get any results for CHOZDT (probably inlined), but
without -ipo I didn't get sensible results for EOS (seems like the line
numbers got messed up somehow for opannotate), so the results are not
entirely comparable. Nonetheless, the ifort timings change only
marginally due to -ipo, so it shouldn't make a big difference.

Ifort and other commercial compilers (I haven't tested others) still
manage to beat gfortran quite badly, see e.g.

http://www.polyhedron.com/

http://physik.fu-berlin.de/~tburnus/gcc-trunk/benchmark/

The reason, it seems, is that ifort (and presumably other commercial
compilers with competitive scores in gas_dyn) avoids calculating
divisions and square roots, replacing them with reciprocals and
reciprocal square roots. E.g. in EOS sqrt(a/b) can be calculated as
1/sqrt(b*(1/a)). This has a big impact on performance, since the SSE
instruction set contains very fast instructions for this, rcpps, rcpss,
rsqrtps, rsqrtss (PPC/Altivec also has equivalent instructions). These
instructions have latencies of 1-2 cycles vs. dozens or even hundreds of
cycles for normal division and square root.  The price to be paid for
this speed is that these reciprocal instructions have an accuracy of
only 12 bits, so clearly they can be enabled only for -ffast-math. And
they are available only for single precision. I'll file a
missed-optimization PR about this.

I think that even with -ffast-math 12 bits accuracy is not ok. There is the possibility of doing another newton iteration step to improve accuracy, that would be ok for -ffast-math. We can, though, add an extra flag -msserecip or however you'd call it to enable use of the instructions with less accuracy.

Richard.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]