This is the mail archive of the gcc@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: RFC: Handling of libgcc symbols in SH shared libraries


> ion. paranoia.c passed with excellent results. However, we couldn't do the =
> benchmarking because of some hardware problems at our end. That's why we ar=
> e delayed in posting the patch. I'm attaching reports from paranoia.c for b=

I see what you mean with needing to benchmark the patch.  You have a number
of loops that slow down the code to the point that I can't tell off-hand
if it will be faster than the fp-bit.c implementation.
Just as a rough guess, your addsf3 and mulsf3 implementations might be take
some four times as much as mine on an SH4.

E.g. this code:

+.L_epil_0:
+       mov     #-23,r3
+       shll    r5
+       mov     #0,r6
+
+! Fit resultant mantissa in 24 bits
+! Apply default rounding
+.L_loop_epil_0:
+        tst    r3,r3
+       bt      .L_loop_epil_out
+
+       add     #1,r3
+       shlr    r4
+
+       bra     .L_loop_epil_0
+       rotcr   r6
+
+! Round mantissa
+.L_loop_epil_out:

takes 2 + 4 * 23 + 3, i.e. 97 cycles on the SH4 .

Considering that r3 and the T bit are dead at .L_loop_epil_out, equivalent
code is:

	shll	r5
	mov.w	LOCAL(m23),r3
	mov	r4,r6
	shll8	r6
	add	r6,r6
	shad	r3,r4
...
LOCAL(m23)
	.word -23

Which takes just four cycles on the SH4.


If the grouping of assembler instruction is supposed to reflect SH4
scheduling, you are missing a few details:

- tst and cmp have a latency of 1, hence they can't be paired with
  a dependent branch.  Likewise, EX instructions have a latency of 1,
  hence they can't be paired with a compare that uses their result.
- You can't pair two EX instructions, nor can you pair two BR instructions.
- Loads have a latency of two.
- CO instructions can't be paired.
- Only two instructions each can form a pair.  When a branch with delay
  slot is paired with a preceding instruction, the delay-slot instruction
  can be paired with a following instruction on the fall-through path.

> dn't cause any problem with fp-bit.c to the best of my knowledge. PIC code =
> consideration is not required since we have taken all the branches to be re=
> lative (Am I missing something?)=0A=0A  We had considered [fd]p-bit functio=

Yes, what you are missing is that normalization can be sped up by using
a lookup table.  libgcc2.c provides and uses __clz_tab for this purpose.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]