[Bug rtl-optimization/70164] New: Code/performance regression due to poor register allocation on Cortex-M0

Thu Mar 10 10:58:00 GMT 2016

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70164

            Bug ID: 70164
           Summary: Code/performance regression due to poor register
                    allocation on Cortex-M0
           Product: gcc
           Version: 6.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: rtl-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: andre.simoesdiasvieira at arm dot com
  Target Milestone: ---

Created attachment 37920
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=37920&action=edit
current ira dump

After a quick investigation of the testcase in
gcc/testsuite/gcc.target/arm/pr45701-1.c for cortex-m0 on trunk I found out
that the test case was failing due to a change in the register allocation after
revision r226901.

Before this register allocation would choose to load the global 'hist_verify'
onto r6 representing 'old_verify' prior to the function call to
pre_process_line. This old_verify is used after the function call. With the
patch it decides to load it onto r3, a caller-saved register, which means it
has to be spilled before the function call and reloaded after.

Before patch:
history_expand_line_internal:
        push    {r3, r4, r5, r6, r7, lr}
        ldr     r3, .L5
        ldr     r5, .L5+4
        ldr     r4, [r3]
        movs    r3, #0
        ldr     r6, [r5]       ; <--- load of 'hist_verify' onto r6
        movs    r7, r0
        str     r3, [r5]
        bl      pre_process_line
        adds    r6, r4, r6
        str     r6, [r5]
        movs    r4, r0
        cmp     r7, r0
        bne     .L2
        bl      str_len
        adds    r0, r0, #1
        bl      x_malloc
        movs    r1, r4
        bl      str_cpy
        movs    r4, r0
.L2:
        movs    r0, r4
        @ sp needed
        pop     {r3, r4, r5, r6, r7, pc}

Current:
history_expand_line_internal:
        push    {r0, r1, r2, r4, r5, r6, r7, lr}
        ldr     r3, .L3
        ldr     r5, .L3+4
        ldr     r6, [r3]
        ldr     r3, [r5]        ; <--- load of 'hist_verify' onto r3
        movs    r7, r0
        str     r3, [sp, #4]    ; <--- Spill
        movs    r3, #0
        str     r3, [r5]
        bl      pre_process_line
        ldr     r3, [sp, #4]    ; <--- Reload
        movs    r4, r0
        adds    r6, r6, r3
        str     r6, [r5]
        cmp     r7, r0
        bne     .L1
        bl      str_len
        adds    r0, r0, #1
        bl      x_malloc
        movs    r1, r4
        bl      str_cpy
        movs    r4, r0
.L1:
        movs    r0, r4
        @ sp needed
        pop     {r1, r2, r3, r4, r5, r6, r7, pc}

I have also attached the dumps for ira and reload for both pre-patch and
current. In the current reload dump insn 9 represents the load onto r3 and insn
62 the spill. In pre-patch ira/reload the load is in insn 10.

I am not familiar with RA in GCC, so I'm not entirely sure what code to blame
for this sub-optimal allocation, any comments or pointers would be most
welcome.