In the case of subsequent loads from subsequent memory locations, if the base address is not loaded into a register (e.g. the loads use a label, that will be converted to pc relative loads), the corresponding peephole patterns will not optimize. The pattern will match, but multiple load instruction will not be generated. The same apply to stores. In the attached modified assembly code the 4 load instructions are replaced by an address computation and a multiple load (note that no additional register is required). Release: gcc version 3.3 20030217 (prerelease) Environment: BUILD & HOST: Linux 2.4.20 i686 unknown TARGET: arm-unknown-elf How-To-Repeat: gcc -S -Os 01.i // 01.i # 1 "01.c" # 1 "<built-in>" # 1 "<command line>" # 1 "01.c" int f(int, int, int, int); void foo () { f(12345,238764,2345234, 83746556); }
Hello, I can confirm that this problem is still present on gcc 3.3 branch and mainline (20030512). Dara
See Dara's comment.
(In reply to comment #2) > See Dara's comment. Occurs even today . foo: @ args = 0, pretend = 0, frame = 0 @ frame_needed = 0, uses_anonymous_args = 0 @ link register save eliminated. ldr r0, .L3 ldr r1, .L3+4 ldr r2, .L3+8 ldr r3, .L3+12 b f .L4: .align 2 .L3: .word 12345 .word 238764 .word 2345234 .word 83746556 .size foo, .-foo .ident "GCC: (GNU) 4.4.0 20090312 (experimental)" .section .note.GNU-stack,"",%progbits
Created attachment 17638 [details] Testcase for gcc 4.4.0
See the attached pqp.c file. With gcc 4.3.3, on such simplistic examples, peephole ldm and stm works: sum: ldr r2, .L3 ldmia r2, {r1, r3} @ phole ldm add r3, r0, r3 add r0, r0, r1 stmia r2, {r0, r3} @ phole stm bx lr With gcc 4.4.0 branch, built on 20090413, it fails: sum: @ args = 0, pretend = 0, frame = 0 @ frame_needed = 0, uses_anonymous_args = 0 @ link register save eliminated. ldr r3, .L3 ldr r2, [r3, #0] ldr r1, [r3, #4] add r2, r0, r2 add r1, r0, r1 str r1, [r3, #4] str r2, [r3, #0] bx lr
(In reply to comment #5) > See the attached pqp.c file. > > [cut] > > With gcc 4.4.0 branch, built on 20090413, it fails: > This seems to be caused by the register order allocation. If I replace the source code lines to operate in the reverse order: hehe.y += pqp; hehe.x += pqp; Then 4.4.0 20090413 generates optimized code: ldr r3, .L3 ldmia r3, {r1, r2} @ phole ldm add r2, r0, r2 add r1, r0, r1 stmia r3, {r1, r2} @ phole stm bx lr While gcc 4.3.3 does not :-) Funny thing isn't it?
(In reply to comment #5) > See the attached pqp.c file. > > With gcc 4.3.3, on such simplistic examples, peephole ldm and stm works: > > sum: > ldr r2, .L3 > ldmia r2, {r1, r3} @ phole ldm > add r3, r0, r3 > add r0, r0, r1 > stmia r2, {r0, r3} @ phole stm > bx lr > > > With gcc 4.4.0 branch, built on 20090413, it fails: > > sum: > @ args = 0, pretend = 0, frame = 0 > @ frame_needed = 0, uses_anonymous_args = 0 > @ link register save eliminated. > ldr r3, .L3 > ldr r2, [r3, #0] > ldr r1, [r3, #4] > add r2, r0, r2 > add r1, r0, r1 > str r1, [r3, #4] > str r2, [r3, #0] > bx lr > We can't use stm or ldm on the second case because ldm's and stm's depend on the lowest numbered register going to the lowest memory address. It's a relic of the register allocator choosing a different order for the registers for such cases. ldm's and stm's are not easily describable in the RTL backend and are semi-bolted on on top of the existing infrastructure using peepholes.
There doesn't appear to be anything that can be improved here. Literal pool loads can't be easily peepholed into LDM, and there aren't many opportunities anyway.