The compiler creates extremely bad code for the ARM target. Consider the following source file: --- SNIP --- unsigned int code_in_ram[100]; void testme(void) { unsigned int *p_rom, *p_ram, *p_end, len; extern unsigned int _ram_erase_sector_start; extern unsigned int _ram_erase_sector_end; p_ram = code_in_ram; p_rom = &_ram_erase_sector_start; len = ((unsigned int)&_ram_erase_sector_end - (unsigned int)&_ram_erase_sector_start) / sizeof(unsigned int); for (p_rom = &_ram_erase_sector_start, p_end = &_ram_erase_sector_end; p_rom < p_end;) { *p_ram++ = *p_rom++; } } --- SNIP --- Compiled with arm-elf-gcc -mcpu=arm7tdmi -S -Os testme.c, we get the following code: --- SNIP --- .file "testme.c" .text .align 2 .global testme .type testme, %function testme: @ args = 0, pretend = 0, frame = 0 @ frame_needed = 0, uses_anonymous_args = 0 @ link register save eliminated. ldr r1, .L6 ldr r2, .L6+4 @ lr needed for prologue b .L2 .L3: ldr r3, [r1], #4 str r3, [r2, #-4] .L2: ldr r3, .L6+8 cmp r1, r3 add r2, r2, #4 bcc .L3 bx lr .L7: .align 2 .L6: .word _ram_erase_sector_start .word code_in_ram .word _ram_erase_sector_end .size testme, .-testme .comm code_in_ram,400,4 .ident "GCC: (GNU) 4.1.0" --- SNIP --- Even a cursory examination reveals that it would be a lot better to write: ldr r1, .L6 ldr r2, .L6+4 ldr r0, .L6+8 b .L2 .L3: ldr r3, [r1], #4 str r3, [r2], #4 .L2: cmp r1, r0 bcc .L3 bx lr This code would be one instruction shorter overall , and two instructions less in the loop. The way gcc-4.1.0 refuses to use post-indexed addressing for the store is especially bizzare, since it does use post-indexed addressing for the preceeding load. Gcc 3.4.3 does not exhibit this behaviour; it compiles the above code to: ldr r2, .L6 ldr r0, .L6+4 cmp r2,r0 ldr r1, .L6 movcs pc,lr .L4: ldr r2,[r2],#4 cmp r2, r0 str r3,[r1],#4 bcc .L4 mov pc,lr While not perfect either, this also only has 4 instructions in the loop.
This code is undefined: len = ((unsigned int)&_ram_erase_sector_end - (unsigned int)&_ram_erase_sector_start) / sizeof(unsigned int); That is obviously undefined as taking the difference between two pointers which are not in the same array is undefined code. Even the comparision: p_rom < p_end; is undefined.
(In reply to comment #1) > This code is undefined: > len = ((unsigned int)&_ram_erase_sector_end > - (unsigned int)&_ram_erase_sector_start) / sizeof(unsigned int); > > That is obviously undefined as taking the difference between two pointers which > are not in the same array is undefined code. > > Even the comparision: > p_rom < p_end; > is undefined. > In the code I took this snippet from, _ram_erase_sector_start and _ram_erase_sector_end are symbols generated by the linker at the start and the end of a special segment which I need to copy to ram, so I would argue that these pointers do in fact refer to the same "array" (in this case, the "array" is the entire flash memory). However, none of this should affect the decision to use (or not to use) the post-indexed addressing mode. If I replace the for loop with a for (len = 100; len > 0; --len), the quality of the generated code actually degrades even further: ldr r2, .L7 ldr r1, .L7+4 @ lr needed for prologue .L2: ldr r3, [r1, #-4] str r3, [r2, #-4] ldr r3, .L7+8 add r2, r2, #4 cmp r2, r3 add r1, r1, #4 bne .L2 bx lr While I thinks it's nifty that gcc recognizes that it doesn't need to keep the len variable, but instead uses p_ram to determine when the loop is finished, I also think it's pretty brain-dead that it won't use post-indexed addressing for either the ldr or str in the loop. And why it thinks it needs to load the constant end address to compare against every time inside the loop instead of once into a scratch register outside the loop is anyone's guess.
For this test case: unsigned int code_in_ram[100]; void testme(void) { unsigned int *p_rom, *p_ram, *p_end, len; extern unsigned int _ram_erase_sector_start; extern unsigned int _ram_erase_sector_end; p_ram = code_in_ram; p_rom = &_ram_erase_sector_start; len = ((unsigned int)&_ram_erase_sector_end - (unsigned int)&_ram_erase_sector_start) / sizeof(unsigned int); for (len = 100; len > 0; --len) { *p_ram++ = *p_rom++; } } I get the following code with a 4.5.0 checkout from a few days ago (see .ident): .cpu arm7tdmi .fpu softvfp .eabi_attribute 20, 1 .eabi_attribute 21, 1 .eabi_attribute 23, 3 .eabi_attribute 24, 1 .eabi_attribute 25, 1 .eabi_attribute 26, 1 .eabi_attribute 30, 4 .eabi_attribute 18, 4 .file "t.c" .text .align 2 .global testme .type testme, %function testme: @ Function supports interworking. @ args = 0, pretend = 0, frame = 0 @ frame_needed = 0, uses_anonymous_args = 0 @ link register save eliminated. ldr r1, .L5 ldr r2, .L5+4 mov r3, #0 .L2: ldr r0, [r2, r3] str r0, [r1, r3] add r3, r3, #4 cmp r3, #400 bne .L2 bx lr .L6: .align 2 .L5: .word code_in_ram .word _ram_erase_sector_start .size testme, .-testme .comm code_in_ram,400,4 .ident "GCC: (GNU) 4.5.0 20090626 (experimental) [trunk revision 148960]" This looks marginally better than the code of comment #2. With the original test case of comment #0 I get the following code: .cpu arm7tdmi .fpu softvfp .eabi_attribute 20, 1 .eabi_attribute 21, 1 .eabi_attribute 23, 3 .eabi_attribute 24, 1 .eabi_attribute 25, 1 .eabi_attribute 26, 1 .eabi_attribute 30, 4 .eabi_attribute 18, 4 .file "t.c" .text .align 2 .global testme .type testme, %function testme: @ Function supports interworking. @ args = 0, pretend = 0, frame = 0 @ frame_needed = 0, uses_anonymous_args = 0 @ link register save eliminated. ldr r2, .L5 ldr r3, .L5+4 ldr r1, .L5+8 b .L2 .L3: ldr r0, [r3], #4 str r0, [r2, #-4] .L2: cmp r3, r1 add r2, r2, #4 bcc .L3 bx lr .L6: .align 2 .L5: .word code_in_ram .word _ram_erase_sector_start .word _ram_erase_sector_end .size testme, .-testme .comm code_in_ram,400,4 .ident "GCC: (GNU) 4.5.0 20090626 (experimental) [trunk revision 148960]" The "ldr r3, .L6+8" line of comment #0 is now hoisted from the loop into r1 in "ldr r1, .L5+8". GCC still does not use a post-increment for the store in the loop (see use of r2), even though it uses a post-increment for the load. So, bug still is there. Might be a dup of one of those many other bugs about poor use of auto-increment instructions by GCC... ;-)
The auto-inc-dec pass fails because the store and the reg increment are not in the same basic block. The dump of the pass before auto-inc-dec (reginfo) looks like this: ;; Function testme (testme) 74 NOTE_INSN_BASIC_BLOCK 72 NOTE_INSN_FUNCTION_BEG 76 r205:SI=`code_in_ram' 73 r203:SI=`_ram_erase_sector_start' 87 r208:SI=`_ram_erase_sector_end' L86: 79 NOTE_INSN_BASIC_BLOCK 80 r206:SI=[r203:SI] 81 [r205:SI-0x4]=r206:SI REG_DEAD: r206:SI 82 r203:SI=r203:SI+0x4 L83: 84 NOTE_INSN_BASIC_BLOCK 85 r205:SI=r205:SI+0x4 88 cc:CC=cmp(r203:SI,r208:SI) REG_EQUAL: cmp(r203:SI,`_ram_erase_sector_end') 89 pc={(ltu(cc:CC,0x0))?L86:pc} REG_DEAD: cc:CC REG_BR_PROB: 0x238c 92 NOTE_INSN_BASIC_BLOCK Then auto-inc-dec comes along and notices the opportunity to merge the load and increment in insns 80 and 82: starting bb 3 82 r203:SI=r203:SI+0x4 81 [r205:SI-0x4]=r206:SI REG_DEAD: r206:SI 81 [r205:SI-0x4]=r206:SI REG_DEAD: r206:SI found mem(81) *(r[205]+-4) 80 r206:SI=[r203:SI] 80 r206:SI=[r203:SI] found mem(80) *(r[203]+0) 82 r203:SI=r203:SI+0x4 found post inc(82) r[203]+=4 trying SIMPLE_POST_INC rescanning insn with uid = 80. deleting insn with uid = 80. deleting insn with uid = 82. ****success 80 r206:SI=[r203:SI++] REG_INC: r203:SI Merging the store and increment of insns 81 and 85 is never tried because the insns are not in the same basic block. Bernd Schmidt has patches that probably address this case of insns in different basic blocks. I will give his patches a try to see if it helps for ARM (the patches were written for Blackfin).
Compiling with "./cc1 -Os t.c -fno-ivopts" I get the following code: .global testme .type testme, %function testme: @ Function supports interworking. @ args = 0, pretend = 0, frame = 0 @ frame_needed = 0, uses_anonymous_args = 0 @ link register save eliminated. ldr r2, .L5 ldr r3, .L5+4 ldr r1, .L5+8 b .L2 .L3: ldr r0, [r3], #4 str r0, [r2], #4 .L2: cmp r3, r1 bcc .L3 bx lr
Comment #3 and comment #5 still true with "GCC: (GNU) 4.5.0 20090808 (experimental) [trunk revision 150579]" -> reconfirmed.
If arm_arm_address_cost is "fixed" to return 1 unconditionally, the expected code of comment #5 comes out at -Os, with the bonus of one less branch: testme: ldr r2, .L4 ldr r1, .L4+4 sub r3, r2, #400 .L2: ldr r0, [r1], #4 str r0, [r3], #4 cmp r3, r2 bne .L2 bx lr This is actually a regression from GCC 3.4.6.
I have played with CSiBE with this patch: -------------------------------------------------- 8< ------------- --- ../../gcc/gcc/config/arm/arm.c 2010-02-12 21:45:29.000000000 +0100 +++ ../../combined/gcc/config/arm/arm.c 2010-02-21 14:54:23.000000000 +0100 @@ -7448,7 +7448,7 @@ arm_arm_address_cost (rtx x) { enum rtx_code c = GET_CODE (x); - +if (1) return 1; if (c == PRE_INC || c == PRE_DEC || c == POST_INC || c == POST_DEC) return 0; if (c == MEM || c == LABEL_REF || c == SYMBOL_REF) -------------------------------------------------- 8< ------------- I tried CSiBE on arm-eabi with "-Os -mabi=aapcs-linux" (latter flag is required for mpeg2dec-0.3.1), but I only looked at size because I have no ARM board to test speed on. Overall the patch results in smaller code: text data bss dec hex filename 3226845 316581 561788 4105214 3ea3fe (TOTALS) unpatched 3226269 316581 561788 4104638 3ea1be (TOTALS) patched For the individual files the results vary. A more detailed investigation is necessary to figure out when a patch like the above is a win, and when it is not.
The trunk now generates: ldr r3, .L4 ldr r2, .L4+4 ldr r1, .L4+8 b .L2 .L3: ldr r0, [r3], #4 str r0, [r2], #4 .L2: cmp r3, r1 bcc .L3 bx lr .L5: This is probably a direct result of Bernd's work to improve auto-inc addressing modes, particularly the ability to discover them when the memref and increment are in different basic blocks.