This is the mail archive of the
gcc@gcc.gnu.org
mailing list for the GCC project.
Re: RFC: ARM Cortex-A8 and floating point performance
On Wednesday 16 June 2010 15:22:32 Ramana Radhakrishnan wrote:
> On Wed, 2010-06-16 at 15:52 +0000, Siarhei Siamashka wrote:
> > Currently gcc (at least version 4.5.0) does a very poor job generating
> > single precision floating point code for ARM Cortex-A8.
> >
> > The source of this problem is the use of VFP instructions which are run
> > on a slow nonpipelined VFP Lite unit in Cortex-A8. Even turning on
> > RunFast mode (flush denormals to zero, disable exceptions) just provides
> > a relatively minor performance gain.
> >
> > The right solution seems to be the use of NEON instructions for doing
> > most of the single precision calculations.
>
> Only in situations that the user is aware about -ffast-math. I will
> point out that single precision floating point operations on NEON are
> not completely IEEE compliant.
Sure. The way how gcc deals with IEEE compliance in the generated code should
be preferably consistent and clearly defined. That's why I reported the
following problem earlier: http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43703
Generating fast floating code for Cortex-A8 with -ffast-math option can be a
good starting point. And ideally it would be nice to be able to mix IEEE
compliant and non-IEEE compliant parts of code. Supporting something like this
would be handy:
typedef __attribute__((ieee_noncompliant)) float fast_float;
For example, Qt defines its own 'qreal' type, which currently defaults to
'float' for ARM and to 'double' for all the other architectures. Many
applications are not that sensitive to strict IEEE compliance or even
precision. But some of the applications and libraries do, so they need to be
respected too.
But in any case, ARM Cortex-A8 has some hardware to do reasonably fast single
precision floating point calculations (with some compliance issues). It makes
a lot of sense to be able to utilize this hardware efficiently from a high
level language such as C/C++ without rewriting tons of existing code.
AFAIK x86 had its own bunch of issues with the 80-bit extended precision, when
just 32-bit or 64-bit precision is needed.
By the way, I tried to experiment with solving/workarounding this floating
point performance issue by making a C++ wrapper class, overloading operators
and using neon intrinsics. It provided a nice speedup in some cases. But gcc
still has troubles generating efficient code for neon intrinsics, and there
were other issues like the size of this newly defined type, which make it not
very practical overall.
--
Best regards,
Siarhei Siamashka