This is the mail archive of the
gcc-patches@gcc.gnu.org
mailing list for the GCC project.
Re: [PATH, SH] Small builtin_strlen improvement
- From: Oleg Endo <oleg dot endo at t-online dot de>
- To: Christian Bruel <christian dot bruel at st dot com>
- Cc: "gcc-patches at gcc dot gnu dot org" <gcc-patches at gcc dot gnu dot org>, Kaz Kojima <kkojima at rr dot iij4u dot or dot jp>
- Date: Sun, 30 Mar 2014 23:02:32 +0200
- Subject: Re: [PATH, SH] Small builtin_strlen improvement
- Authentication-results: sourceware.org; auth=none
- References: <533288C1 dot 1080306 at st dot com>
Hi,
On Wed, 2014-03-26 at 08:58 +0100, Christian Bruel wrote:
> This patches adds a few instructions to the inlined builtin_strlen to
> unroll the remaining bytes for word-at-a-time loop. This enables to have
> 2 distinct execution paths (no fall-thru in the byte-at-a-time loop),
> allowing block alignment assignation. This partially improves the
> problem reported with by Oleg. in [Bug target/0539] New: [SH] builtin
> string functions ignore loop and label alignment
Actually, my original concern was the (mis)alignment of the 4 byte inner
loop. AFAIR it's better for the SH pipeline if the first insn of a loop
is 4 byte aligned.
>
> whereas the test now expands (-O2 -m4) as
> mov r4,r0
> tst #3,r0
> mov r4,r2
> bf/s .L12
> mov r4,r3
> mov #0,r2
> .L4:
> mov.l @r4+,r1
> cmp/str r2,r1
> bf .L4
> add #-4,r4
> mov.b @r4,r1
> tst r1,r1
> bt .L2
> add #1,r4
> mov.b @r4,r1
> tst r1,r1
> bt .L2
> add #1,r4
> mov.b @r4,r1
> tst r1,r1
> mov #-1,r1
> negc r1,r1
> add r1,r4
> .L2:
> mov r4,r0
> rts
> sub r3,r0
> .align 1
> .L12:
> mov.b @r4+,r1
> tst r1,r1
> bf/s .L12
> mov r2,r3
> add #1,r3
> mov r4,r0
> rts
> sub r3,r0
>
>
> Best tuning compared to the "compact" version I got on is ~1% for c++
> regular expression benchmark, but well, code looks best this way.
I haven't done any measurements but doesn't this introduce some
performance regressions here and there due to the increased code size?
Maybe the byte unrolling should not be done at -O2 but at -O3?
Moreover, post-inc addressing on the bytes could be used. Ideally we'd
get something like this:
mov r4,r0
tst #3,r0
bf/s .L12
mov r4,r3
mov #0,r2
.L4:
mov.l @r4+,r1
cmp/str r2,r1
bf .L4
add #-4,r4
mov.b @r4+,r1
tst r1,r1
bt .L2
mov.b @r4+,r1
tst r1,r1
bt .L2
mov.b @r4+,r1
tst r1,r1
mov #-1,r1
subc r1,r4
sett
.L2:
mov r4,r0
rts
subc r3,r0
.align 1
.L12:
mov.b @r4+,r1
tst r1,r1
bf .L12
mov r4,r0
rts
subc r3,r0
I'll have a look at the missed 'subc' cases.
Cheers,
Oleg