This is the mail archive of the gcc@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: Performance analysis of Polyhedron/gas_dyn


Richard Guenther wrote:
See also http://www.suse.de/~gcctest/c++bench/polyhedron/analysis.html
(same conclusion for gas_dyn).

Thanks, I seem to have completely missed that page (though I was aware of your polyhedron tester).


>On 4/27/07, Janne Blomqvist <blomqvist.janne@gmail.com> wrote: >> The reason, it seems, is that ifort (and presumably other commercial
compilers with competitive scores in gas_dyn) avoids calculating
divisions and square roots, replacing them with reciprocals and
reciprocal square roots. E.g. in EOS sqrt(a/b) can be calculated as
1/sqrt(b*(1/a)). This has a big impact on performance, since the SSE
instruction set contains very fast instructions for this, rcpps, rcpss,
rsqrtps, rsqrtss (PPC/Altivec also has equivalent instructions). These
instructions have latencies of 1-2 cycles vs. dozens or even hundreds of
cycles for normal division and square root.  The price to be paid for
this speed is that these reciprocal instructions have an accuracy of
only 12 bits, so clearly they can be enabled only for -ffast-math. And
they are available only for single precision. I'll file a
missed-optimization PR about this.

I think that even with -ffast-math 12 bits accuracy is not ok. There is the possibility of doing another newton iteration step to improve accuracy, that would be ok for -ffast-math. We can, though, add an extra flag -msserecip or however you'd call it to enable use of the instructions with less accuracy.

I agree it can be an issue, but OTOH people who care about precision probably 1. avoid -ffast-math 2. use double precision (where these reciprocal instrs are not available). Intel calls it -no-prec-div, but it's enabled for the "-fast" catch-all option.


On a related note, our beloved competitors generally have some high level flag for combining all these fancy and potentially unsafe optimizations (e.g. -O4, -fast, -fastsse, -Ofast, etc.). For gcc, at least FP benchmarks seem to do generally well with something like "-O3 -funroll-loops -ftree-vectorize -ffast-math -march=native -mfpmath=sse", but it's quite a mouthful.

--
Janne Blomqvist


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]