[Bug middle-end/78809] Inline strcmp with small constant strings
qing.zhao at oracle dot com
gcc-bugzilla@gcc.gnu.org
Tue Oct 24 14:59:00 GMT 2017
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78809
--- Comment #10 from Qing Zhao <qing.zhao at oracle dot com> ---
>> From the data, we can see the inlined version of strcmp (by glibc) is much
>> slower than the direct call to strcmp. (this is for size 2)
>> I am using GCC farm machine gcc116:
>
> This result doesn't make sense - it looks like GCC is moving the strcmp call in
> the 2nd case as a loop invariant, so you're just measuring a loop with just a
> subtract and orr instruction…
Yes, Wilco is right here. -ftree-loop-im moves the call to strcmp out of the
loop.
in order to avoid this issue, I changed the options to
-O -fno-tree-loop-im
and checked the assembly of the routine “cmp2” for the INLINED and Non-INLINED
version.
Inlined version:
cmp2:
mov x4, x0
mov w2, 51712
movk w2, 0x3b9a, lsl 16
mov w0, 0
mov w3, 102
b .L3
.L2:
neg w1, w1
orr w0, w0, w1
subs w2, w2, #1
beq .L5
.L3:
ldrb w1, [x4]
subs w1, w3, w1
bne .L2
ldrb w1, [x4, 1]
neg w1, w1
b .L2
.L5:
ret
Non-inlined version:
cmp2:
stp x29, x30, [sp, -48]!
add x29, sp, 0
stp x19, x20, [sp, 16]
stp x21, x22, [sp, 32]
mov x22, x0
mov w19, 51712
movk w19, 0x3b9a, lsl 16
mov w20, 0
adrp x21, .LC0
add x21, x21, :lo12:.LC0
.L2:
mov x1, x21
mov x0, x22
bl strcmp
orr w20, w20, w0
subs w19, w19, #1
bne .L2
mov w0, w20
ldp x19, x20, [sp, 16]
ldp x21, x22, [sp, 32]
ldp x29, x30, [sp], 48
ret
Then, the run-time performance data is:
qinzhao@gcc116:~/Bugs/78809/const_cmp/perf$ sh t_p
/home/qinzhao/Install/latest/bin/gcc -O -fno-tree-loop-im t_p_1.c t_p.c
-DINLINED
inlined version
34.73user 0.00system 0:34.73elapsed 99%CPU (0avgtext+0avgdata 360maxresident)k
0inputs+0outputs (0major+135minor)pagefaults 0swaps
/home/qinzhao/Install/latest/bin/gcc -O -fno-tree-loop-im t_p_1.c t_p.c
non-inlined version
138.79user 0.00system 2:18.77elapsed 100%CPU (0avgtext+0avgdata
356maxresident)k
0inputs+0outputs (0major+135minor)pagefaults 0swaps
Yes, looks like that the inlined version is much faster than the non-inlined
version on aarch64 platform.
More information about the Gcc-bugs
mailing list