arm load/store multiple

Wed Nov 17 04:00:00 GMT 1999

The problem is constraining the register allocator to do the right thing.  
To be able to use ldm/stm the registers must be in ascending order (though 
not necessarily contiguous) for ascending memory addresses.  Since I can't 
see any way to constrain the register allocator to do this, we can only 
use post-reload peepholes to try and patch up appropriate sequences of 
loads and stores; but this tends to be very inefficient, particularly if 
the scheduler has moved things around and broken up seqeunces of loads and 
stores.

The only other way to get ldm/stm sequences is to catch them during rtl 
generation and use explicit hard registers (this is used to handle block 
copies), but then the explicit hard registers can hurt register 
allocation, so it is a trade-off between the number of loads that can be 
merged and the impact on the number of registers used -- the current code 
uses 4 loads at a time, which has about the same impact on register 
allocation as a function call (eg a call to memcpy).

If someone can think of a better way of representing this, so that some 
pre-regalloc pass can combine loads in a way that will force the register 
allocator to do the right thing, then I'm open to suggestions.

Another useful extension might be to have block spills/restores when we 
run out of registers, but that would need a lot of work in reload.

Richard.

> I was thinking that gcc should use the load/store multiple
> instructions for:
> 
> int *ip;
> 
> main() {
>   int i1, i2, i3, i4, i5, i6;
>   i1 = ip[2];
>   i2 = ip[3];
>   i3 = ip[4];
>   i4 = ip[5];
>   i5 = ip[6];
>   i6 = ip[7];
>   asm volatile ("# bla bla");
>   ip[2] = i2;
>   ip[3] = i1;
>   ip[4] = i4;
>   ip[5] = i3;
>   ip[6] = i6;
>   ip[7] = i5;
> }
> 
> type of code, but yet, it seems to have difficultly using them.  Since
> I am not an arm head, I'm not certain that this would be a win.  Since
> I haven't tracked it down, I don't know why it isn't using stm/ldm
> instructions more often.  I checked the arm port and it seemed as
> though the code tries to group 2, 3 and 4 loads that are adjacent into
> one instruction.  And yet, I see:
> 
>         ldr     r2, L3
>         ldr     r3, [r2, #0]
>         ldr     r4, [r3, #8]
>         add     r1, r3, #12
>         ldmia   r1, {r1, r5, ip}        @ phole ldm
>         add     r2, r3, #24
>         ldmia   r2, {r2, lr}    @ phole ldm
>         # bla bla
>         str     r2, [r3, #28]
>         str     r1, [r3, #8]
>         str     r4, [r3, #12]
>         str     ip, [r3, #16]
>         str     r5, [r3, #20]
>         str     lr, [r3, #24]
> 
> As I explore around with different sample code, I see that the
> compiler does try a little bit, but just doesn't succeed often.
> 
> I can't help but wonder if these types of instructions are supported
> in the right fashion in the md file.
> 
> Anyway, thought I would forward this along.