gcc 3.4 > mainline performance regression

This is from the gcc-help mailing list.  It's mentioned there for ARM,
but it's just as bad for x86-64.  

It appears that memory references to arrays aren't being hoisted out
of loops: in this test case, gcc 3.4 doesn't touch memory at all in
the loop, but 4.3pre (and 4.2, etc) does.

Here's the test case:

void foo(int *a)
{       int i;
        for (i = 0; i < 1000000; i++)
   a[0] += a[1];

gcc 3.4.5 -O2:

        leal    (%rcx,%rsi), %edx
        decl    %eax
        movl    %edx, %ecx
        jns     .L5

gcc 4.3pre -O2:

        addl    4(%rdi), %eax
        addl    $1, %edx
        cmpl    $1000000, %edx
        movl    %eax, (%rdi)
        jne     .L2



Hi David,

I've noticed the same problem with the GCC 4.1.3 (in thumb mode for ARM).

When a simple test file is compiled with -s to only get the "pseudo" assembly form, quality of the generated code is quite poor. I've seen an equivalent inquiry to codesourcery mailing list, quoting that even if gcc 4.x series perform  good optimization, simple cases of loop are sadly compiled. But I'm quite surprised by the miss of comments to this quote...

Ps: Happy new year everybody!



  We were using GCC 3.4.0 to generate Thumb code for ARM processor,
switching to GCC 4.1.1 has improved our code size (we always use -Os switch),
but has severely altered the execution speed.
 After further investigation, we isolate one the problem in the
following example:
 Source code:
void foo(int *a)
{	int i;
	for (i = 0; i < 1000000; i++)
   a[0] += a[1];
The result with GCC 3.4.0 with -mthumb -Os was:
00000000 <foo>:
  0:	b500      	push	{lr}
  2:	6803      	ldr	r3, [r0, #0]
  4:	4a03      	ldr	r2, [pc, #12]	(14 <.text+0x14>)
  6:	6841      	ldr	r1, [r0, #4]
  8:	3a01      	sub	r2, #1
  a:	185b      	add	r3, r3, r1
  c:	2a00      	cmp	r2, #0
  e:	d1fb      	bne	8 <foo+0x8>
 10:	6003      	str	r3, [r0, #0]
 12:	bd00      	pop	{pc}
 14:	4240      	neg	r0, r0
 16:	000f      	lsl	r7, r1, #0
 when compiled for ARM with GCC 4.1.1 (and mainline too) with -mthumb
-O1, we get:
00000000 <foo>:
  0:	b510      	push	{r4, lr}
  2:	1c04      	adds	r4, r0, #0
  4:	2200      	movs	r2, #0
  6:	6841      	ldr	r1, [r0, #4]
  8:	4803      	ldr	r0, [pc, #12]	(18 <.text+0x18>)
  a:	6823      	ldr	r3, [r4, #0]
  c:	185b      	adds	r3, r3, r1
  e:	3201      	adds	r2, #1
 10:	4282      	cmp	r2, r0
 12:	d1fb      	bne.n	c <foo+0xc>
 14:	6023      	str	r3, [r4, #0]
 16:	bd10      	pop	{r4, pc}
 18:	4240      	negs	r0, r0
 1a:	000f      	lsls	r7, r1, #0
-> No so bad but slower than 3.4.0

   when compiled with -mthumb -Os, we get:
00000000 <foo>:
  0:	b510      	push	{r4, lr}
  2:	6802      	ldr	r2, [r0, #0]
  4:	6844      	ldr	r4, [r0, #4]
  6:	2100      	movs	r1, #0
  8:	4b03      	ldr	r3, [pc, #12]	(18 <.text+0x18>)
  a:	3101      	adds	r1, #1
  c:	1912      	adds	r2, r2, r4
  e:	4299      	cmp	r1, r3
 10:	d1fa      	bne.n	8 <foo+0x8>
 12:	6002      	str	r2, [r0, #0]
 14:	bd10      	pop	{r4, pc}
 16:	0000      	lsls	r0, r0, #0
 18:	4240      	negs	r0, r0
 1a:	000f      	lsls	r7, r1, #0
 -> The Load of the loop end value is performed within the loop !

   when compiled with -mthumb -O3, we get:
00000000 <foo>:
  0:	b530      	push	{r4, r5, lr}
  2:	6802      	ldr	r2, [r0, #0]
  4:	4d05      	ldr	r5, [pc, #20]	(1c <.text+0x1c>)
  6:	1d04      	adds	r4, r0, #4
  8:	2100      	movs	r1, #0
  a:	6823      	ldr	r3, [r4, #0]
  c:	3101      	adds	r1, #1
  e:	18d3      	adds	r3, r2, r3
 10:	1c1a      	adds	r2, r3, #0
 12:	6003      	str	r3, [r0, #0]
 14:	42a9      	cmp	r1, r5
 16:	d1f8      	bne.n	a <foo+0xa>
 18:	bd30      	pop	{r4, r5, pc}
 1a:	0000      	lsls	r0, r0, #0
 1c:	4240      	negs	r0, r0
 1e:	000f      	lsls	r7, r1, #0
 -> Amazingly slow !

    Does anybody has a magic set of options to generate an efficient and
small code as 3.4.0 did.
 Thanks in advance for any hints on this problem.

