This is the mail archive of the gcc@gcc.gnu.org mailing list for the GCC project.
Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]
Re: RFC: Handling of libgcc symbols in SH shared libraries

From: Joern Rennecke <joern dot rennecke at superh dot com>
To: kumar107 at rediffmail dot com
Cc: joern dot rennecke at superh dot com (Joern Rennecke), gcc at gcc dot gnu dot org, nickc at redhat dot com
Date: Thu, 5 Aug 2004 19:35:51 +0100 (BST)
Subject: Re: RFC: Handling of libgcc symbols in SH shared libraries
>  This is a multipart mime message
> 
> 
> --Next_1091696227---0-202.54.124.178-20027
> Content-type: text/html;
> 	charset=iso-8859-1
> Content-Transfer-Encoding: quoted-printable
> Content-Disposition: inline

This encoding makes your messages harder to reply to.  It also seems to
cause the mailing list software to discard them.

> Hi Joern,=0A=0AThanks for your valuable comments.=0A=0A>You have a number=
> =0A>of loops that slow down the code to the point that I can't tell off-han=
> d=0A>if it will be faster than the fp-bit.c implementation.=0A>Just as a ro=
> ugh guess, your addsf3 and mulsf3 implementations might be take=0A>some fou=
> r times as much as mine on an SH4.=0A=0AIt would be better than fp-bit.c im=
> plementation.=0AAnyway, the point about loops in mulsf3 is valid.=0AI can s=
> ee the point you highlighted.=0A =0AI can fix the problems you highlighted =
> and it should be fine.=0ASince you are already done with some routines (sin=
> gle precicion arithmetic), =0Aand since they are apprarently better, I thin=
> k I should concenterate on remaining =0Aroutines. Please note that the rout=
> ines are functionally correct, so it=0Adoes reduce effort as compared to re=
> -writing from scratch.=0A

That's a non sequitur.  To write an optimized assembler function, you have
to consider the different algorithms available, and for all the likely
candidates, see how the needed computations and data flow requirements
constrain the scheduling of the 'hot' paths through the code, and do your
instruction selection and scheduling around these constraints.
When you have enough information to definitely say that one version is
inferior, you can drop it.
E.g. for muldf3, you can do a 64 bit multiply by adding four partial
products together in a straightforward manner.  Or you can do a 62 bit
multiply with three partial products exploiting the equation
(a0-a1) * (b1-b0) == a0b1+a1b0-a0b0-a1b1 .  Or even 56 bits are enough,
using (a0+a1) * (b0+b1) = a0b0+a1b1+a0b1+a1b0:


/* This is just an untested draft without full register allocation.  */
mov.l LOCAL(x00100000),t0
mov #-24,t1
mov.l LOCAL(x7fe00000),t2

mov.l a0,@-r15
add t0,a0
mov.l b0,@-r15
add t0,b0
mov.l LOCAL(x001fffff),t3
tst t2,a0
or t0,a0
bt LOCAL(inf_nan_denorm_or_zero_a)
and t3,a0
tst t2,b0
or t0,b0
bt LOCAL(inf_nan_denorm_or_zero_b)
and t3,b0
mov a1,t0
shld t1,t0
mov.l LOCAL(xffffff),t2
shll8 a0

or t0,a0
mov b1,t0
shld t1,t0

shll8 b0

or t0,b0

and t2,a1

and t2,b1

dmulu.l a1,b1
add b0,b1

add a0,a1

sts mach,r2
sts macl,r3
dmulu.l a1,b1
mov r3,t1
xtrct r2,t1
shlr8 t1
clrt
sts macl,t3
sts mach,t2
dmulu.l a0,b0
subc r3,t3

subc r2,t2 ! clears T
mov.w LOCAL(d0),t0
sts macl,r1
sts mach,r0
addc t1,t3

addc t0,t2 ! clears T
mov.w LOCAL(m24),t4
subc r1,t3

subc r0,t2 ! clears T

or t3,r3

shld t4,t3

shll8 t2

or t2,t3

addc t3,r1

addc t0,r0

/* 56 bit Fraction is now in r0 / r1, with bits 0..23 of r3 holding
   extra sticky bits. */
I've left a blank line here after every unpaired instruction.  Some of these
probably would have to be filled with register saves / restores for proper
register allocation.  Some of it might also be used to process the exponent.
There are no latency-induced stalls in this code.  Your code is riddled with
such stalls, because you didn't plan the data flow thouroughly enough.

With numerical codes, there are also often opportunities for shortcuts when
you can use an imprecise intermediate result to arrive at the exact end
result.

Also, since your code did not make it to the mailing list, it is not clear
if it has actually been contributed, so there are open questions about
assigning copyright to the FSF.  And it doesn't support SH1, anyway.

> On my end, I can additionally test your work. Plea=
> se let me know=0Aif that is OK. =0A=0A

Yes, that would be appreciated.

>>If the grouping of assembler instruc=
> tion is supposed to reflect SH4=0A>scheduling, you are missing a few detail=
> s:=0A=0AWe are aware of the scheduling restrictions. And we have taken care=
>  of these rules =0Aof grouping instructions, wherever algorithm allows.=0A=

Then what is the meaning of the blank lines interspersed with the instructions?

> =0A>Yes, what you are missing is that normalization can be sped up by using=
> =0A>a lookup table.  libgcc2.c provides and uses __clz_tab for this purpose=
> .=0A=0AI see that it can be done. But I have concerns about the dependence =
> of floating point implementations over other things.

I think it is good practice to use other library parts where it doesn't
slow you down materially, and saves on code size.
The fp-bit.c code uses functions from lib1funcs.asm to do variable shifts
for SH1 / SH2, too.  In fact, I think using them for variable shifts from
the SH1 / SH2 asm implementation is also a good idea where there is no faster
fixed-shift solution available.  E.g when you add two numbers with different
exponent, you really have to do a dynamic shift.

> You have written SET p=
> atterns for FP arithmetic functions in md file. This kind of dependence mig=
> ht make the code unmaintainable. I'd prefer the modularity in code.=0A=0A

These functions are supposed to do a specific job on a specific processor
architecture.  Neither are going to change any time soon.
If you want to support different rounding modes, I think that would best be
accomplished by having different functions for the various rounding modes,
and have the caller use a jump table (or base address) if the rounding mode
is not known at compile time.  This avoids penalizing the common case that
only round-to-nearest is wanted, keeps considerations of the per-thread
rounding mode in the caller, and keeps the defintion of our current
functions the same.

On the other hand, if the ABI is changed (i.e. a different set of
callee-saved registers), code written in C can just be recompiled, but
the asm code stays the same.  Having md patterns that describe the
register usage exactly makes these functions safe to use.

> We=
>  have also sent you the DP implementation. I don't think defining md patter=
> ns for DFmode functions would =0Agive significant savings. The register usa=
> ge will be enough for them.

For patterns like muldf3 / adddf3, you are propably right, they are likely
to want a lot of registers up-front so that reducing general purpose register
usage would make them slower.
However, there are other factors that make having md patterns for these
functions desirable:

- The -mrenesas option changes the abi to make macl and mach callee-saved.
  Without md patterns to tell it otherwise, the compiler will assume that
  you mulsf3 / muldf3 implementations save these registers.  For most of
  the targets, you could say that this is an old bug because fp-bit is
  not multilibbed on -mrenesas, but the Symbian port (Hi Nick!) sets the
  renesas bit in TARGET_DEFAULT, hence compiling fp-bit will do the right
  thing.
- The rtl optimizers work much better when they see add:DF or mul:DF
  than call:DF.
  
For compare / negate / extend / truncate, the savings in register usage
are still significant.  In fact, having a library function for
negsf3 / negdf3 doesn't make any sense in itself, we'll only need it for
backwards compatibility.
Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]