This is the mail archive of the gcc@gcc.gnu.org mailing list for the GCC project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: RFC: ARM Cortex-A8 and floating point performance

From: Andrew Pinski <pinskia at gmail dot com>
To: Richard Guenther <richard dot guenther at gmail dot com>
Cc: Siarhei Siamashka <siarhei dot siamashka at gmail dot com>, "gcc at gcc dot gnu dot org" <gcc at gcc dot gnu dot org>
Date: Wed, 16 Jun 2010 08:09:09 -0700
Subject: Re: RFC: ARM Cortex-A8 and floating point performance
References: <201006161552.43706.siarhei.siamashka@gmail.com> <AANLkTilAcwQpnZLSL9AQyBWCpSgg4AXD9t8vmeLHrOlh@mail.gmail.com>

Sent from my iPhone

On Jun 16, 2010, at 6:04 AM, Richard Guenther <richard.guenther@gmail.com > wrote:

On Wed, Jun 16, 2010 at 5:52 PM, Siarhei Siamashka
<siarhei.siamashka@gmail.com> wrote:
Hello,

Currently gcc (at least version 4.5.0) does a very poor job generating single precision floating point code for ARM Cortex-A8.

The source of this problem is the use of VFP instructions which are run on a slow nonpipelined VFP Lite unit in Cortex-A8. Even turning on RunFast mode (flush denormals to zero, disable exceptions) just provides a relatively minor performance gain.

The right solution seems to be the use of NEON instructions for doing most of the single precision calculations.

I wonder if it would be difficult to introduce the following changes to the gcc generated code when optimizing for cortex-a8: 1. Allocate single precision variables only to evenly or oddly numbered s-registers. 2. Instead of using 'fadds s0, s0, s2' or similar instructions, do 'vadd.f32 d0, d0, d1' instead.

The number of single precision floating point registers gets effectively halved this way. Supporting '-mfloat-abi=hard' may be a bit tricky (packing/unpacking of register pairs may be needed to ensure proper parameters passing to functions). Also there may be other problems, like dealing with strict IEEE-754 compliance (maybe a special variable attribute for relaxing compliance requirements could be useful). But this looks like the only solution to fix poor performance on ARM Cortex-A8 processor.

Actually clang 2.7 seems to be working exactly this way. And it is outperforming gcc 4.5.0 by up to a factor of 2 or 3 on some single precision floating point tests that I tried on ARM Cortex-A8.
On i?86 we have -mfpmath={sse,x87}, I suppose you could add
-mfpmath=neon for arm (properly conflicting with -mfloat-abi=hard
and requiring neon support).

Except unlike sse, neon does not fully support IEEE support. So this should only be done with -ffast-math :). The point that it is slow is not good enough to change it to be something that is wrong and fast.

Richard.

--
Best regards,
Siarhei Siamashka

Follow-Ups:
- Re: RFC: ARM Cortex-A8 and floating point performance
  - From: David Brown

References:
- RFC: ARM Cortex-A8 and floating point performance
  - From: Siarhei Siamashka
- Re: RFC: ARM Cortex-A8 and floating point performance
  - From: Richard Guenther

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]