This is the mail archive of the mailing list for the GCC project.

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Call for testers: SH optimized software floating point

In 2004, I've been working on optimzied software floating point for the SH4:
Rakesh Kumar posted a an SH assembly software floating point implementation
that also supported the SH2 and was further along to completion, but with
lower performance:

After a longer hiatus, I've now combined these these code bases.

For SH1/SH2 support, I've used my own code for comparisons and
single / double precision conversions, and Rakesh Kumar's and
Aanchal Khanna's code as a basis for the arithmentic and integer/floating
point conversions.  After checking with the Copyright Clerk that a suitable
assignment was on file, I've added Copyright headers, changed generated
denormals to be consistent with what the comparison code expects, and added
support for SH1 (which doesn't have delayed branches nor 32*32 bit multiply).

For SH3 and SH4, I've used my own code, some from 2004, and some which I've
written now.  The code I wrote in 2004 is scheduled primarily for the
SH4-200, with some considerations for earliuer processors where it didn't
hurt the SH4.
The newer code is scheduled primarily for the ST40-300, while some concessions
have been made for the SH4-200 (e.g. using extra pc-relative constants to
reduce EX group pressure).
Speed is generally favoured over size, particularly for normalized number
handling, but to some extent also for denormalized numbers.  I.e. there
are very few loops, no inter-module calls (which couldn't be guaranteed
to be in bsr range), and I've added alignment instructions to help
The downside of this is that the code is somewhat larger than it would be
otherwise, and it is extremely hard to get path or even code coverage
for all the code when testing it.
divsf3 uses the div1 instruction for the fraction computation; extracting
quotient bytes when they are ready while feeding in new divident bytes
obviates the need for an extra shift register, and the other pipepline
is kept busy working on the exponent and flagging cases where the input
is not finite or the output not normalized - these flags are then checked
simultanously at the end using cmp/str.
divdf3, on the other hand, uses a numerical algorithm.  Each step relies on
the previous steps to contain the error in a certain interval not only
to keep the output error interval in check, but also so that the topmost
bits of the calculated defect are known to be only sign extensions.
The implementation that is actually used in divdf3.S can have two
different run times for normalized numbers, depending on wether the result
from the penultimate step is found to be sufficient to calcualte a
correctly rounded result; this helps keep the average computing time down,
to about 66 cycles for the ST40-300.  If you are more interested in keeping
worst-case times down, you might consider finishing divdf-rt.S; this should
be able to operate on normalized numbers in something like 71 or 72 cycles.

One known issue triggered by the sh.c / code to generate the calls
to the library is PR rtl-optimization/28618; I have no proper solution
for this yet, the patch I posted earlier caused other regressions.
A possible workaround is to compile with -fno-schedule-insns .
(N.B., running sched2 should be fine, the problem is the scheduling
 pass before register allocation which overcommits r0).

I've also attached a small patch to fix a problem with the m2e / m3e /
m4-nofpu multilib variants of the libgcc2 DImode <->> DFmode
conversion functions.

Attachment: softfp-20060902-1927.gz
Description: GNU Zip compressed data

Attachment: sh-2e-doublefix
Description: Binary data

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]