Bug 9831

Summary: [ARM] Peephole for multiple load/store could be more effective.
Product: gcc Reporter: gertom
Component: targetAssignee: Not yet assigned to anyone <unassigned>
Status: RESOLVED WONTFIX    
Severity: enhancement CC: gcc-bugs, ramana.r, rearnsha
Priority: P3 Keywords: missed-optimization
Version: 3.3   
Target Milestone: ---   
Host: Target: arm-*-elf
Build: Known to work:
Known to fail: Last reconfirmed: 2005-12-09 04:27:54
Attachments: multiple-load-store.tar.gz
Testcase for gcc 4.4.0

Description gertom 2003-02-24 15:26:00 UTC
In the case of subsequent loads from subsequent memory locations, if the base address is not loaded into a register (e.g. the loads use a label, that will be converted to pc relative loads), the corresponding peephole patterns will not optimize. The pattern will match, but multiple load instruction will not be generated. The same apply to stores.

In the attached modified assembly code the 4 load instructions are replaced by an address computation and a multiple load (note that no additional register is required).

Release:
gcc version 3.3 20030217 (prerelease)

Environment:
BUILD & HOST: Linux 2.4.20 i686 unknown
TARGET: arm-unknown-elf

How-To-Repeat:
gcc -S -Os 01.i

// 01.i

# 1 "01.c"
# 1 "<built-in>"
# 1 "<command line>"
# 1 "01.c"
int f(int, int, int, int);

void foo ()
{
  f(12345,238764,2345234, 83746556);
}
Comment 1 Dara Hazeghi 2003-05-26 19:32:40 UTC
Hello,

I can confirm that this problem is still present on gcc 3.3 branch and mainline (20030512).

Dara
Comment 2 Andrew Pinski 2003-05-26 19:34:23 UTC
See Dara's comment.
Comment 3 Ramana Radhakrishnan 2009-03-13 10:54:18 UTC
(In reply to comment #2)
> See Dara's comment.

Occurs even today . 

foo:
        @ args = 0, pretend = 0, frame = 0
        @ frame_needed = 0, uses_anonymous_args = 0
        @ link register save eliminated.
        ldr     r0, .L3
        ldr     r1, .L3+4
        ldr     r2, .L3+8
        ldr     r3, .L3+12
        b       f
.L4:
        .align  2
.L3:
        .word   12345
        .word   238764
        .word   2345234
        .word   83746556
        .size   foo, .-foo
        .ident  "GCC: (GNU) 4.4.0 20090312 (experimental)"
        .section        .note.GNU-stack,"",%progbits
Comment 4 Alexandre Pereira Nunes 2009-04-14 20:04:08 UTC
Created attachment 17638 [details]
Testcase for gcc 4.4.0
Comment 5 Alexandre Pereira Nunes 2009-04-14 20:07:29 UTC
See the attached pqp.c file.

With gcc 4.3.3, on such simplistic examples, peephole ldm and stm works:

sum:
        ldr     r2, .L3
        ldmia   r2, {r1, r3}    @ phole ldm
        add     r3, r0, r3
        add     r0, r0, r1
        stmia   r2, {r0, r3}    @ phole stm
        bx      lr


With gcc 4.4.0 branch, built on 20090413, it fails:

sum:
        @ args = 0, pretend = 0, frame = 0
        @ frame_needed = 0, uses_anonymous_args = 0
        @ link register save eliminated.
        ldr     r3, .L3
        ldr     r2, [r3, #0]
        ldr     r1, [r3, #4]
        add     r2, r0, r2
        add     r1, r0, r1
        str     r1, [r3, #4]
        str     r2, [r3, #0]
        bx      lr
Comment 6 Alexandre Pereira Nunes 2009-04-14 20:11:38 UTC
(In reply to comment #5)
> See the attached pqp.c file.
> 
> [cut]
> 
> With gcc 4.4.0 branch, built on 20090413, it fails:
> 

This seems to be caused by the register order allocation. If I replace the source code lines to operate in the reverse order:

 hehe.y += pqp;
 hehe.x += pqp;

Then 4.4.0 20090413 generates optimized code:

  ldr     r3, .L3
        ldmia   r3, {r1, r2}    @ phole ldm
        add     r2, r0, r2
        add     r1, r0, r1
        stmia   r3, {r1, r2}    @ phole stm
        bx      lr

While gcc 4.3.3 does not :-) Funny thing isn't it?

Comment 7 Ramana Radhakrishnan 2009-06-16 10:01:26 UTC
(In reply to comment #5)
> See the attached pqp.c file.
> 
> With gcc 4.3.3, on such simplistic examples, peephole ldm and stm works:
> 
> sum:
>         ldr     r2, .L3
>         ldmia   r2, {r1, r3}    @ phole ldm
>         add     r3, r0, r3
>         add     r0, r0, r1
>         stmia   r2, {r0, r3}    @ phole stm
>         bx      lr
> 
> 
> With gcc 4.4.0 branch, built on 20090413, it fails:
> 
> sum:
>         @ args = 0, pretend = 0, frame = 0
>         @ frame_needed = 0, uses_anonymous_args = 0
>         @ link register save eliminated.
>         ldr     r3, .L3
>         ldr     r2, [r3, #0]
>         ldr     r1, [r3, #4]
>         add     r2, r0, r2
>         add     r1, r0, r1
>         str     r1, [r3, #4]
>         str     r2, [r3, #0]
>         bx      lr
> 


We can't use stm or ldm on the second case because ldm's and stm's depend on the lowest numbered register going to the lowest memory address. It's a relic of the register allocator choosing a different order for the registers for such cases.

ldm's and stm's are not easily describable in the RTL backend and are semi-bolted on on top of the existing infrastructure using peepholes.


Comment 8 Wilco 2019-05-17 22:16:11 UTC
There doesn't appear to be anything that can be improved here. Literal pool loads can't be easily peepholed into LDM, and there aren't many opportunities anyway.