This is the mail archive of the
mailing list for the GCC project.
Re: [Info], Add suport for PowerPC IEEE 128-bit floating point
- From: Michael Meissner <meissner at linux dot vnet dot ibm dot com>
- To: Segher Boessenkool <segher at kernel dot crashing dot org>
- Cc: Michael Meissner <meissner at linux dot vnet dot ibm dot com>, gcc-patches at gcc dot gnu dot org, dje dot gcc at gmail dot com
- Date: Wed, 16 Jul 2014 11:44:14 -0400
- Subject: Re: [Info], Add suport for PowerPC IEEE 128-bit floating point
- Authentication-results: sourceware.org; auth=none
- References: <20140715182354 dot GA21906 at ibm-tiger dot the-meissners dot org> <20140715212031 dot GA11448 at ibm-tiger dot the-meissners dot org> <20140715215033 dot GA671 at gate dot crashing dot org>
On Tue, Jul 15, 2014 at 04:50:33PM -0500, Segher Boessenkool wrote:
> On Tue, Jul 15, 2014 at 05:20:31PM -0400, Michael Meissner wrote:
> > I did some timing tests to compare the new PowerPC IEEE 128-bit results to the
> > current implementation of long double using the IBM extended format.
> > The test consisted a short loop doing the operation over arrays of 1,024
> > elements, reading in two values, doing the operation, and then storing it back.
> > This loop in turn was done multiple times, with the idea that most of the
> > values would be in the cache, and we didn't have to worry about pre-fetching,
> > etc.
> > The float, double tests were done with vectorization disabled, while the vector
> > float and vector double tests, the compiler was allowed to do the normal auto
> > vectorization.
> > The number reported was how much longer the second column took over the first:
> I assume you mean the other way around?
> > Generally, the __float128 is 2x slower than the current IBM extended double
> > format, except for divide, where it is 5x slower. I must say, the software
> > floating point emulation routines worked well, and once the proper macros were
> > setup, I only needed to override the type used for IEEE 128-bit.
> > Add loop
> > ========
> > float vs double: 2.00x
> Why is float twice as slow as double?
I'm not sure, and the Book IV's claims that lfsu vs. ldsu, fadds vs. fadd, and
stfsu vs. stfdu are the same number of cycles as each other. The inner loop is
;; Float loop
;; Double loop
I would suspect that given internally the PowerPC keeps scalar floating point
in double format in the registers probably accounts for slow downs in a tight
loop (i.e. lfsu must load and convert the value to double, fadds must do the
add and then round to float precision, and stfsu must convert the double to
float format). We've also seen cases where load with update is slower than
doing the instructions separately.
Michael Meissner, IBM
IBM, M/S 2506R, 550 King Street, Littleton, MA 01460-6245, USA
email: email@example.com, phone: +1 (978) 899-4797