This is the mail archive of the
gcc@gcc.gnu.org
mailing list for the GCC project.
Re: RFC: Handling of libgcc symbols in SH shared libraries
- From: Joern Rennecke <joern dot rennecke at superh dot com>
- To: kumar107 at rediffmail dot com
- Cc: joern dot rennecke at superh dot com (Joern Rennecke), gcc at gcc dot gnu dot org
- Date: Wed, 4 Aug 2004 20:53:35 +0100 (BST)
- Subject: Re: RFC: Handling of libgcc symbols in SH shared libraries
> ion. paranoia.c passed with excellent results. However, we couldn't do the =
> benchmarking because of some hardware problems at our end. That's why we ar=
> e delayed in posting the patch. I'm attaching reports from paranoia.c for b=
I see what you mean with needing to benchmark the patch. You have a number
of loops that slow down the code to the point that I can't tell off-hand
if it will be faster than the fp-bit.c implementation.
Just as a rough guess, your addsf3 and mulsf3 implementations might be take
some four times as much as mine on an SH4.
E.g. this code:
+.L_epil_0:
+ mov #-23,r3
+ shll r5
+ mov #0,r6
+
+! Fit resultant mantissa in 24 bits
+! Apply default rounding
+.L_loop_epil_0:
+ tst r3,r3
+ bt .L_loop_epil_out
+
+ add #1,r3
+ shlr r4
+
+ bra .L_loop_epil_0
+ rotcr r6
+
+! Round mantissa
+.L_loop_epil_out:
takes 2 + 4 * 23 + 3, i.e. 97 cycles on the SH4 .
Considering that r3 and the T bit are dead at .L_loop_epil_out, equivalent
code is:
shll r5
mov.w LOCAL(m23),r3
mov r4,r6
shll8 r6
add r6,r6
shad r3,r4
...
LOCAL(m23)
.word -23
Which takes just four cycles on the SH4.
If the grouping of assembler instruction is supposed to reflect SH4
scheduling, you are missing a few details:
- tst and cmp have a latency of 1, hence they can't be paired with
a dependent branch. Likewise, EX instructions have a latency of 1,
hence they can't be paired with a compare that uses their result.
- You can't pair two EX instructions, nor can you pair two BR instructions.
- Loads have a latency of two.
- CO instructions can't be paired.
- Only two instructions each can form a pair. When a branch with delay
slot is paired with a preceding instruction, the delay-slot instruction
can be paired with a following instruction on the fall-through path.
> dn't cause any problem with fp-bit.c to the best of my knowledge. PIC code =
> consideration is not required since we have taken all the branches to be re=
> lative (Am I missing something?)=0A=0A We had considered [fd]p-bit functio=
Yes, what you are missing is that normalization can be sped up by using
a lookup table. libgcc2.c provides and uses __clz_tab for this purpose.