27016 – [4.3/4.4/4.5 Regression] ARM optimizer produces severely suboptimal code

Bug 27016 - [4.3/4.4/4.5 Regression] ARM optimizer produces severely suboptimal code

Summary: [4.3/4.4/4.5 Regression] ARM optimizer produces severely suboptimal code

Status:	RESOLVED FIXED

Alias:	None

Product:	gcc
Classification:	Unclassified
Component:	target (show other bugs)
Version:	4.1.0

Importance:	P2 normal
Target Milestone:	4.3.5
Assignee:	Not yet assigned to anyone

URL:
Keywords:

Depends on:
Blocks:	16996 36712
	Show dependency tree / graph

Reported:	2006-04-04 08:12 UTC by Eric Dönges
Modified:	2010-02-26 01:23 UTC (History)
CC List:	3 users (show)

See Also:
Host:
Target:	arm-elf, arm-eabi
Build:
Known to work:	3.4.6
Known to fail:	4.4.0 4.5.0
Last reconfirmed:	2010-01-02 01:05:52

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Eric Dönges 2006-04-04 08:12:43 UTC

The compiler creates extremely bad code for the ARM target.
Consider the following source file:

--- SNIP ---
unsigned int code_in_ram[100];

void testme(void)
{
  unsigned int *p_rom, *p_ram, *p_end, len;

  extern unsigned int _ram_erase_sector_start;
  extern unsigned int _ram_erase_sector_end;


  p_ram = code_in_ram;
  p_rom = &_ram_erase_sector_start;
  len = ((unsigned int)&_ram_erase_sector_end 
         - (unsigned int)&_ram_erase_sector_start) / sizeof(unsigned int);

  for (p_rom = &_ram_erase_sector_start, p_end = &_ram_erase_sector_end;
       p_rom < p_end;) {
    *p_ram++ = *p_rom++;
  }
}
--- SNIP ---

Compiled with arm-elf-gcc -mcpu=arm7tdmi -S -Os testme.c, we get the following code:

--- SNIP ---
        .file   "testme.c"
        .text
        .align  2
        .global testme
        .type   testme, %function
testme:
        @ args = 0, pretend = 0, frame = 0
        @ frame_needed = 0, uses_anonymous_args = 0
        @ link register save eliminated.
        ldr     r1, .L6
        ldr     r2, .L6+4
        @ lr needed for prologue
        b       .L2
.L3:
        ldr     r3, [r1], #4
        str     r3, [r2, #-4]
.L2:
        ldr     r3, .L6+8
        cmp     r1, r3
        add     r2, r2, #4
        bcc     .L3
        bx      lr
.L7:
        .align  2
.L6:
        .word   _ram_erase_sector_start
        .word   code_in_ram
        .word   _ram_erase_sector_end
        .size   testme, .-testme
        .comm   code_in_ram,400,4
        .ident  "GCC: (GNU) 4.1.0"
--- SNIP ---

Even a cursory examination reveals that it would be a lot better to write:

    ldr r1, .L6
    ldr r2, .L6+4
    ldr r0, .L6+8
    b   .L2
    
.L3:
    ldr r3, [r1], #4
    str r3, [r2], #4
.L2:
    cmp r1, r0
    bcc .L3
    bx  lr

This code would be one instruction shorter overall , and two instructions less in the loop. The way
gcc-4.1.0 refuses to use post-indexed addressing for the store is especially bizzare, since it does use
post-indexed addressing for the preceeding load. Gcc 3.4.3 does not exhibit this behaviour; it compiles
the above code to:

   ldr r2, .L6
   ldr r0, .L6+4
   cmp r2,r0
   ldr r1, .L6
   movcs pc,lr

.L4:
   ldr r2,[r2],#4
   cmp r2, r0
   str r3,[r1],#4
   bcc .L4
   mov pc,lr

While not perfect either, this also only has 4 instructions in the loop.

Comment 1 Andrew Pinski 2006-04-04 08:16:45 UTC

This code is undefined:
  len = ((unsigned int)&_ram_erase_sector_end 
         - (unsigned int)&_ram_erase_sector_start) / sizeof(unsigned int);

That is obviously undefined as taking the difference between two pointers which are not in the same array is undefined code.

Even the comparision:
p_rom < p_end;
is undefined.

Comment 2 Eric Dönges 2006-04-04 09:13:59 UTC

(In reply to comment #1)
> This code is undefined:
>   len = ((unsigned int)&_ram_erase_sector_end 
>          - (unsigned int)&_ram_erase_sector_start) / sizeof(unsigned int);
> 
> That is obviously undefined as taking the difference between two pointers which
> are not in the same array is undefined code.
> 
> Even the comparision:
> p_rom < p_end;
> is undefined.
> 
In the code I took this snippet from, _ram_erase_sector_start and _ram_erase_sector_end are symbols
generated by the linker at the start and the end of a special segment which I need to copy to ram,
so I would argue that these pointers do in fact refer to the same "array" (in this case, the "array" is the
entire flash memory). 

However, none of this should affect the decision to use (or not to use) the post-indexed addressing
mode. If I replace the for loop with a for (len = 100; len > 0; --len), the quality of the generated code
actually degrades even further:

        ldr     r2, .L7
        ldr     r1, .L7+4
        @ lr needed for prologue
.L2:
        ldr     r3, [r1, #-4]
        str     r3, [r2, #-4]
        ldr     r3, .L7+8
        add     r2, r2, #4
        cmp     r2, r3
        add     r1, r1, #4
        bne     .L2
        bx      lr

While I thinks it's nifty that gcc recognizes that it doesn't need to keep the len variable, but instead
uses p_ram to determine when the loop is finished, I also think it's pretty brain-dead that it won't
use post-indexed addressing for either the ldr or str in the loop. And why it thinks it needs to load
the constant end address to compare against every time inside the loop instead of once into a scratch
register outside the loop is anyone's guess.

Comment 3 Steven Bosscher 2009-06-30 11:35:25 UTC

For this test case:

unsigned int code_in_ram[100];

void testme(void)
{
  unsigned int *p_rom, *p_ram, *p_end, len;

  extern unsigned int _ram_erase_sector_start;
  extern unsigned int _ram_erase_sector_end;

  p_ram = code_in_ram;
  p_rom = &_ram_erase_sector_start;
  len = ((unsigned int)&_ram_erase_sector_end
         - (unsigned int)&_ram_erase_sector_start) / sizeof(unsigned int);

  for (len = 100; len > 0; --len)
    {
      *p_ram++ = *p_rom++;
    }
}


I get the following code with a 4.5.0 checkout from a few days ago (see .ident):

        .cpu arm7tdmi
        .fpu softvfp
        .eabi_attribute 20, 1
        .eabi_attribute 21, 1
        .eabi_attribute 23, 3
        .eabi_attribute 24, 1
        .eabi_attribute 25, 1
        .eabi_attribute 26, 1
        .eabi_attribute 30, 4
        .eabi_attribute 18, 4
        .file   "t.c"
        .text
        .align  2
        .global testme
        .type   testme, %function
testme:
        @ Function supports interworking.
        @ args = 0, pretend = 0, frame = 0
        @ frame_needed = 0, uses_anonymous_args = 0
        @ link register save eliminated.
        ldr     r1, .L5
        ldr     r2, .L5+4
        mov     r3, #0
.L2:
        ldr     r0, [r2, r3]
        str     r0, [r1, r3]
        add     r3, r3, #4
        cmp     r3, #400
        bne     .L2
        bx      lr
.L6:
        .align  2
.L5:
        .word   code_in_ram
        .word   _ram_erase_sector_start
        .size   testme, .-testme
        .comm   code_in_ram,400,4
        .ident  "GCC: (GNU) 4.5.0 20090626 (experimental) [trunk revision 148960]"


This looks marginally better than the code of comment #2.



With the original test case of comment #0 I get the following code:

        .cpu arm7tdmi
        .fpu softvfp
        .eabi_attribute 20, 1
        .eabi_attribute 21, 1
        .eabi_attribute 23, 3
        .eabi_attribute 24, 1
        .eabi_attribute 25, 1
        .eabi_attribute 26, 1
        .eabi_attribute 30, 4
        .eabi_attribute 18, 4
        .file   "t.c"
        .text
        .align  2
        .global testme
        .type   testme, %function
testme:
        @ Function supports interworking.
        @ args = 0, pretend = 0, frame = 0
        @ frame_needed = 0, uses_anonymous_args = 0
        @ link register save eliminated.
        ldr     r2, .L5
        ldr     r3, .L5+4
        ldr     r1, .L5+8
        b       .L2
.L3:
        ldr     r0, [r3], #4
        str     r0, [r2, #-4]
.L2:
        cmp     r3, r1
        add     r2, r2, #4
        bcc     .L3
        bx      lr
.L6:
        .align  2
.L5:
        .word   code_in_ram
        .word   _ram_erase_sector_start
        .word   _ram_erase_sector_end
        .size   testme, .-testme
        .comm   code_in_ram,400,4
        .ident  "GCC: (GNU) 4.5.0 20090626 (experimental) [trunk revision 148960]"


The "ldr     r3, .L6+8" line of comment #0 is now hoisted from the loop into r1 in "ldr     r1, .L5+8".  GCC still does not use a post-increment for the store in the loop (see use of r2), even though it uses a post-increment for the load.

So, bug still is there.  Might be a dup of one of those many other bugs about poor use of auto-increment instructions by GCC... ;-)

Comment 4 Steven Bosscher 2009-06-30 13:21:17 UTC

The auto-inc-dec pass fails because the store and the reg increment are not in the same basic block.  The dump of the pass before auto-inc-dec (reginfo) looks like this:

;; Function testme (testme)

   74 NOTE_INSN_BASIC_BLOCK
   72 NOTE_INSN_FUNCTION_BEG
   76 r205:SI=`code_in_ram'
   73 r203:SI=`_ram_erase_sector_start'
   87 r208:SI=`_ram_erase_sector_end'
L86:
   79 NOTE_INSN_BASIC_BLOCK
   80 r206:SI=[r203:SI]
   81 [r205:SI-0x4]=r206:SI
      REG_DEAD: r206:SI
   82 r203:SI=r203:SI+0x4
L83:
   84 NOTE_INSN_BASIC_BLOCK
   85 r205:SI=r205:SI+0x4
   88 cc:CC=cmp(r203:SI,r208:SI)
      REG_EQUAL: cmp(r203:SI,`_ram_erase_sector_end')
   89 pc={(ltu(cc:CC,0x0))?L86:pc}
      REG_DEAD: cc:CC
      REG_BR_PROB: 0x238c
   92 NOTE_INSN_BASIC_BLOCK


Then auto-inc-dec comes along and notices the opportunity to merge the load and increment in insns 80 and 82:

starting bb 3
   82 r203:SI=r203:SI+0x4
   81 [r205:SI-0x4]=r206:SI
      REG_DEAD: r206:SI
   81 [r205:SI-0x4]=r206:SI
      REG_DEAD: r206:SI
found mem(81) *(r[205]+-4)
   80 r206:SI=[r203:SI]
   80 r206:SI=[r203:SI]
found mem(80) *(r[203]+0)
   82 r203:SI=r203:SI+0x4
found post inc(82) r[203]+=4
trying SIMPLE_POST_INC
rescanning insn with uid = 80.
deleting insn with uid = 80.
deleting insn with uid = 82.
****success    80 r206:SI=[r203:SI++]
      REG_INC: r203:SI


Merging the store and increment of insns 81 and 85 is never tried because the insns are not in the same basic block.

Bernd Schmidt has patches that probably address this case of insns in different basic blocks.  I will give his patches a try to see if it helps for ARM (the patches were written for Blackfin).

Comment 5 Steven Bosscher 2009-06-30 13:27:09 UTC

Compiling with "./cc1 -Os t.c -fno-ivopts" I get the following code:

        .global testme
        .type   testme, %function
testme:
        @ Function supports interworking.
        @ args = 0, pretend = 0, frame = 0
        @ frame_needed = 0, uses_anonymous_args = 0
        @ link register save eliminated.
        ldr     r2, .L5
        ldr     r3, .L5+4
        ldr     r1, .L5+8
        b       .L2
.L3:
        ldr     r0, [r3], #4
        str     r0, [r2], #4
.L2:
        cmp     r3, r1
        bcc     .L3
        bx      lr

Comment 6 Steven Bosscher 2010-01-02 01:05:52 UTC

Comment #3 and comment #5 still true with "GCC: (GNU) 4.5.0 20090808 (experimental) [trunk revision 150579]" -> reconfirmed.

Comment 7 Steven Bosscher 2010-02-12 22:52:18 UTC

If arm_arm_address_cost is "fixed" to return 1 unconditionally, the expected code of comment #5 comes out at -Os, with the bonus of one less branch:

testme:
	ldr	r2, .L4
	ldr	r1, .L4+4
	sub	r3, r2, #400
.L2:
	ldr	r0, [r1], #4
	str	r0, [r3], #4
	cmp	r3, r2
	bne	.L2
	bx	lr

This is actually a regression from GCC 3.4.6.

Comment 8 Steven Bosscher 2010-02-21 14:32:06 UTC

I have played with CSiBE with this patch:

-------------------------------------------------- 8< -------------
--- ../../gcc/gcc/config/arm/arm.c	2010-02-12 21:45:29.000000000 +0100
+++ ../../combined/gcc/config/arm/arm.c	2010-02-21 14:54:23.000000000 +0100
@@ -7448,7 +7448,7 @@
 arm_arm_address_cost (rtx x)
 {
   enum rtx_code c  = GET_CODE (x);
-
+if (1) return 1;
   if (c == PRE_INC || c == PRE_DEC || c == POST_INC || c == POST_DEC)
     return 0;
   if (c == MEM || c == LABEL_REF || c == SYMBOL_REF)
-------------------------------------------------- 8< -------------

I tried CSiBE on arm-eabi with "-Os -mabi=aapcs-linux" (latter flag is required for mpeg2dec-0.3.1), but I only looked at size because I have no ARM board to test speed on.

Overall the patch results in smaller code:

   text	   data	    bss	    dec	    hex	filename
3226845	 316581	 561788	4105214	 3ea3fe	(TOTALS)	    unpatched
3226269	 316581	 561788	4104638	 3ea1be	(TOTALS)	    patched

For the individual files the results vary. A more detailed investigation is necessary to figure out when a patch like the above is a win, and when it is not.

Comment 9 Jeffrey A. Law 2010-02-26 01:23:20 UTC

The trunk now generates:

	ldr	r3, .L4
	ldr	r2, .L4+4
	ldr	r1, .L4+8
	b	.L2
.L3:
	ldr	r0, [r3], #4
	str	r0, [r2], #4
.L2:
	cmp	r3, r1
	bcc	.L3
	bx	lr
.L5:

This is probably a direct result of Bernd's work to improve auto-inc addressing modes, particularly the ability to discover them when the memref and increment are in different basic blocks.