[Bug middle-end/78809] Inline strcmp with small constant strings

Tue Oct 24 14:59:00 GMT 2017

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78809

--- Comment #10 from Qing Zhao <qing.zhao at oracle dot com> ---
>> From the data, we can see the inlined version of strcmp (by glibc) is much
>> slower than the direct call to strcmp.  (this is for size 2)
>> I am using GCC farm machine gcc116:
> 
> This result doesn't make sense - it looks like GCC is moving the strcmp call in
> the 2nd case as a loop invariant, so you're just measuring a loop with just a
> subtract and orr instruction…

Yes, Wilco is right here.  -ftree-loop-im moves the call to strcmp out of the
loop.
in order to avoid this issue, I changed the options to

-O -fno-tree-loop-im

and checked the assembly of the routine “cmp2” for the INLINED and Non-INLINED
version.

Inlined version:
cmp2:
        mov     x4, x0
        mov     w2, 51712
        movk    w2, 0x3b9a, lsl 16
        mov     w0, 0
        mov     w3, 102
        b       .L3
.L2:
        neg     w1, w1
        orr     w0, w0, w1
        subs    w2, w2, #1
        beq     .L5
.L3:
        ldrb    w1, [x4]
        subs    w1, w3, w1
        bne     .L2
        ldrb    w1, [x4, 1]
        neg     w1, w1
        b       .L2
.L5:
        ret

Non-inlined version:
cmp2:
        stp     x29, x30, [sp, -48]!
        add     x29, sp, 0
        stp     x19, x20, [sp, 16]
        stp     x21, x22, [sp, 32]
        mov     x22, x0
        mov     w19, 51712
        movk    w19, 0x3b9a, lsl 16
        mov     w20, 0
        adrp    x21, .LC0
        add     x21, x21, :lo12:.LC0
.L2:
        mov     x1, x21
        mov     x0, x22
        bl      strcmp
        orr     w20, w20, w0
        subs    w19, w19, #1
        bne     .L2
        mov     w0, w20
        ldp     x19, x20, [sp, 16]
        ldp     x21, x22, [sp, 32]
        ldp     x29, x30, [sp], 48
        ret

Then, the run-time performance data is:

qinzhao@gcc116:~/Bugs/78809/const_cmp/perf$ sh t_p
/home/qinzhao/Install/latest/bin/gcc -O -fno-tree-loop-im t_p_1.c t_p.c
-DINLINED
inlined version
34.73user 0.00system 0:34.73elapsed 99%CPU (0avgtext+0avgdata 360maxresident)k
0inputs+0outputs (0major+135minor)pagefaults 0swaps
/home/qinzhao/Install/latest/bin/gcc -O -fno-tree-loop-im t_p_1.c t_p.c
non-inlined version
138.79user 0.00system 2:18.77elapsed 100%CPU (0avgtext+0avgdata
356maxresident)k
0inputs+0outputs (0major+135minor)pagefaults 0swaps

Yes, looks like that the inlined version is much faster than the non-inlined
version on aarch64 platform.