Bug 61578 - [4.9 regression] Code size increase for ARM thumb compared to 4.8.x when compiling with -Os
Summary: [4.9 regression] Code size increase for ARM thumb compared to 4.8.x when comp...
Status: RESOLVED FIXED
Alias: None
Product: gcc
Classification: Unclassified
Component: target (show other bugs)
Version: 4.9.0
: P2 normal
Target Milestone: 6.0
Assignee: Not yet assigned to anyone
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2014-06-21 10:32 UTC by Fredrik Hederstierna
Modified: 2016-06-13 10:04 UTC (History)
6 users (show)

See Also:
Host: linux
Target: arm-none-eabi
Build:
Known to work: 4.8.3
Known to fail: 4.9.0, 5.0
Last reconfirmed: 2015-01-15 00:00:00


Attachments
Toolchain build script 4.8.3. (822 bytes, application/x-shellscript)
2014-06-21 10:33 UTC, Fredrik Hederstierna
Details
Toolchain build script 4.9.0 (818 bytes, application/x-shellscript)
2014-06-21 10:33 UTC, Fredrik Hederstierna
Details
Makefile for examples. (831 bytes, text/plain)
2014-06-21 10:34 UTC, Fredrik Hederstierna
Details
Example source 1. (569 bytes, text/x-csrc)
2014-06-21 10:35 UTC, Fredrik Hederstierna
Details
Example source 2. (363 bytes, text/x-csrc)
2014-06-21 10:35 UTC, Fredrik Hederstierna
Details
CSiBE test results size (54.75 KB, application/zip)
2014-09-23 21:45 UTC, Fredrik Hederstierna
Details
Simple patch to exclude use of ip (1.41 KB, patch)
2014-11-02 20:35 UTC, Fredrik Hederstierna
Details | Diff
Size benchmark gcc 4.8, gcc 4.9 and trunk. (40.56 KB, application/vnd.ms-excel)
2014-11-02 20:38 UTC, Fredrik Hederstierna
Details
CSiBE benchmark with gnu89, updates with newer trunk as reference. (51.23 KB, application/vnd.ms-excel)
2015-03-02 10:02 UTC, Fredrik Hederstierna
Details
Simple example giving +50% code size increase compared gcc-4.8.5 and gcc-5.2.0 (2.36 KB, application/zip)
2015-08-18 14:29 UTC, Fredrik Hederstierna
Details
Disasm for -mthumb also, code size increase was +48%. (1.72 KB, application/zip)
2015-08-18 14:41 UTC, Fredrik Hederstierna
Details
Outpout of the reload pass (BAD) (2.01 KB, text/plain)
2015-09-22 17:41 UTC, Dominik Vogt
Details
CSiBE results for arm thumb1 and thumb2 code generation for various toolchains. (501.43 KB, application/zip)
2016-03-20 12:38 UTC, Fredrik Hederstierna
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Fredrik Hederstierna 2014-06-21 10:32:28 UTC
We have problems when trying to switch from GCC 4.8.3 to GCC 4.9.0 (arm-eabi thumb1 for arm966e-s core) due to significant code size increase (1-2%).

I attach our toolchain build scripts and two small example programs, though I'm not fully convinced that these examples fully catch the issue.
Although the examples show increase +4.81% and +2% growth, though the examples are slightly constructed.

First I suspected it was related to Bug 59535, since the code size decrease significantly when compiling without LRA (-mno-lra), though still it does not make the full picture, since still the codes size is ~1% bigger then GCC 4.8.3.

We also tried with and without LTO, but it does not solve our problem either.
See attached example.
Build with 'make' and compare with 'make compare'.

Thanks and Best Regards,
/Fredrik
Comment 1 Fredrik Hederstierna 2014-06-21 10:33:25 UTC
Created attachment 32980 [details]
Toolchain build script 4.8.3.
Comment 2 Fredrik Hederstierna 2014-06-21 10:33:59 UTC
Created attachment 32981 [details]
Toolchain build script 4.9.0
Comment 3 Fredrik Hederstierna 2014-06-21 10:34:54 UTC
Created attachment 32982 [details]
Makefile for examples.
Comment 4 Fredrik Hederstierna 2014-06-21 10:35:21 UTC
Created attachment 32983 [details]
Example source 1.
Comment 5 Fredrik Hederstierna 2014-06-21 10:35:48 UTC
Created attachment 32984 [details]
Example source 2.
Comment 6 Ramana Radhakrishnan 2014-07-24 14:42:46 UTC
We need preprocessed source for someone to actually try this. CC'ing Vlad as LRA maintainer / author.
Comment 7 Vladimir Makarov 2014-08-14 14:23:33 UTC
(In reply to Ramana Radhakrishnan from comment #6)
> We need preprocessed source for someone to actually try this. CC'ing Vlad as
> LRA maintainer / author.

Thanks for reporting this.  Sorry, I am busy with other things right now.  I'll start to work on the PR in September.
Comment 8 Fredrik Hederstierna 2014-09-23 21:45:06 UTC
Created attachment 33544 [details]
CSiBE test results size

Attached some tests with CSiBE v2.1.1 for arm-eabi.
It seems like the results are very scattered,
sometimes GCC 4.8.3 produces smaller code than GCC 4.9.1,
but on other files it seems to be vice versa.
/Fredrik
Comment 9 Fredrik Hederstierna 2014-11-02 20:35:18 UTC
Created attachment 33866 [details]
Simple patch to exclude use of ip

Simple patch that make it possible to optionally exclude use of ip for thumb1 when optimizing for size.
Comment 10 Fredrik Hederstierna 2014-11-02 20:38:15 UTC
Created attachment 33867 [details]
Size benchmark gcc 4.8, gcc 4.9 and trunk.

Added updated CSiBE benchmark for GCC 4.8, GCC 4.9 and trunk.
It's obvious that excluding ip gives shorted code.
Then there is something on trunk that makes some project become very large, which should be investigated perhaps.
Comment 11 Zhenqiang Chen 2014-11-03 01:33:44 UTC
> Added updated CSiBE benchmark for GCC 4.8, GCC 4.9 and trunk.
> It's obvious that excluding ip gives shorted code.
> Then there is something on trunk that makes some project become very large,
> which should be investigated perhaps.

When compiling CSiBE with trunk, please add option "-std=gnu89".
Comment 12 Ramana Radhakrishnan 2014-12-12 15:03:47 UTC
(In reply to Fredrik Hederstierna from comment #9)
> Created attachment 33866 [details]
> Simple patch to exclude use of ip
> 
> Simple patch that make it possible to optionally exclude use of ip for
> thumb1 when optimizing for size.


Please follow https://gcc.gnu.org/contribute.html with respect to contributing patches.
Comment 13 Ramana Radhakrishnan 2015-01-15 11:25:40 UTC
Confirmed.
Comment 14 Fredrik Hederstierna 2015-03-02 10:02:41 UTC
Created attachment 34916 [details]
CSiBE benchmark with gnu89, updates with newer trunk as reference.

I added attachment with new CSiBE measurement from newer trunk,
and now using -std=gnu89 for correctness.

It looks alot better on current trunk, the code size is now smaller than 4.8.x, so in this case this issue seems at least partly resolved.

Though, still the proposed patch with "-mip-fixed" on trunk,
still gets approx  -0.2%  reduced code size in average,
which might seem significant. See attached docs.

The CSiBE test also it indicates that LRA might improved specific areas,
where the code size gets worse with IP fixed, which could be investigated further. Example file are libmspack/test/cabd_md5.c.

So, I'm just wondering if you that are or have been involved with this issue,
thinks the proposed patch is a good idea and worth putting time to make it proper for commit? I just do not want to put time and effort in this patch if its not likely to get in, or you think its a bad idea.

Please comment :)  Thanks and Kind Regards Fredrik
Comment 15 Ramana Radhakrishnan 2015-03-13 12:15:29 UTC
I'm going to remove the regression marker on this as this is now just purely further code size improvements. Please submit patches on gcc-patches@gcc.gnu.org for further discussion. Patches attached to bug reports aren't reviewed.
Comment 16 Jakub Jelinek 2015-06-26 19:53:14 UTC
GCC 4.9.3 has been released.
Comment 17 Fredrik Hederstierna 2015-08-18 14:29:30 UTC
Created attachment 36201 [details]
Simple example giving +50% code size increase compared gcc-4.8.5 and gcc-5.2.0

Simple example giving +50% code size increase compared gcc-4.8.5 and gcc-5.2.0.

We still cannot use GCC 4.9, GCC 5.1 and GCC 5.2 due to code size increase.
The GCC 4.8.x produces the smallest binaries for us, and we still haven't figured out exactly why.

Attached is an attempt to create a test case where GCC 4.9 and after creates bad code. Please check it out.

With GCC-5.2.0 we got 55% code size increase compared to GCC-4.8.5.

In parallel I work with the CSiBE benchmark and I hope I can contribute with some more metrics soon.

Thanks and Best Regards/Fredrik
Comment 18 Fredrik Hederstierna 2015-08-18 14:41:14 UTC
Created attachment 36202 [details]
Disasm for -mthumb also, code size increase was +48%.
Comment 19 Fredrik Hederstierna 2015-08-19 10:56:41 UTC
I'm not sure why bug 59535 was closed, same problem might still exist, quote:

> Zhenqiang Chen 2014-09-03 06:17:44 UTC
> 
> Here is a small case to show lra introduces one more register copy (tested with trunk and 4.9).

int isascii (int c)
{
  return c >= 0 && c < 128;
}
With options: -Os -mthumb -mcpu=cortex-m0, I got

isascii:
	mov	r3, #0
	mov	r2, #127
	mov	r1, r3   //???
	cmp	r2, r0
	adc	r1, r1, r3
	mov	r0, r1
	bx	lr

With options: -Os -mthumb -mcpu=cortex-m0 -mno-lra, I got

isascii:
	mov	r2, #127
	mov	r3, #0
	cmp	r2, r0
	adc	r3, r3, r3
	mov	r0, r3
	bx	lr

Testing 4.8.5 and 5.2.0 still still produces same bigger code in GCC 5.2.
So something adds a register copy in this small case.
/Fredrik
Comment 20 Vladimir Makarov 2015-09-01 14:06:31 UTC
(In reply to Fredrik Hederstierna from comment #19)
> I'm not sure why bug 59535 was closed, same problem might still exist, quote:
> 
> > Zhenqiang Chen 2014-09-03 06:17:44 UTC
> > 
> > Here is a small case to show lra introduces one more register copy (tested with trunk and 4.9).
> 
> int isascii (int c)
> {
>   return c >= 0 && c < 128;
> }
> With options: -Os -mthumb -mcpu=cortex-m0, I got
> 
> isascii:
> 	mov	r3, #0
> 	mov	r2, #127
> 	mov	r1, r3   //???
> 	cmp	r2, r0
> 	adc	r1, r1, r3
> 	mov	r0, r1
> 	bx	lr
> 
> With options: -Os -mthumb -mcpu=cortex-m0 -mno-lra, I got
> 
> isascii:
> 	mov	r2, #127
> 	mov	r3, #0
> 	cmp	r2, r0
> 	adc	r3, r3, r3
> 	mov	r0, r3
> 	bx	lr
> 
> Testing 4.8.5 and 5.2.0 still still produces same bigger code in GCC 5.2.
> So something adds a register copy in this small case.

Thanks for providing a small test case.  I've started to work on the PR.  I hope to provide a fix for the case today.
Comment 21 Fredrik Hederstierna 2015-09-01 14:15:53 UTC
Great, thanks!

Just a note as you are looking into this,
neither GCC 4.8.x nor GCC 5.2.x produces the optimal code I think for this case,
isn't it better to load result register r0, instead of go over r3?

GCC 4.8.5 (good version):

isascii:
	mov	r2, #127
	mov	r3, #0
	cmp	r2, r0
	adc	r3, r3, r3
	mov	r0, r3
	bx	lr

better ??:

isascii:
	mov	r2, #127
	mov	r3, #0
	cmp	r2, r0
+	adc	r0, r3, r3   <----- put result in R0 directly?
-	adc	r3, r3, r3
-	mov	r0, r3
	bx	lr

This would save one more instruction if I'm thinking right.
BR, Fredrik
Comment 22 Vladimir Makarov 2015-09-01 19:38:24 UTC
Author: vmakarov
Date: Tue Sep  1 19:37:52 2015
New Revision: 227382

URL: https://gcc.gnu.org/viewcvs?rev=227382&root=gcc&view=rev
Log:
2015-09-01  Vladimir Makarov  <vmakarov@redhat.com>

	PR target/61578
	* lra-lives.c (process_bb_lives): Process move pseudos with the
	same value for copies and preferences
	* lra-constraints.c (match_reload): Create match reload pseudo
	with the same value from single dying input pseudo.


Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/lra-constraints.c
    trunk/gcc/lra-lives.c
Comment 23 Fredrik Hederstierna 2015-09-03 05:16:07 UTC
Thanks for your patch, I tried it out, and it solves the small example fine, the code now is similar to GCC 4.8 for this particular example.

Though when I ran the full CSiBE benchmark, the code size unfortunately grew approx +150 bytes overall for the full suite. So the patch did not solve the generic root problem with code size increase unfortunately.

This is strange and I'm thinking of how to continue from here, this issue has diverged a bit too much (mostly because of my own fault) with several examples etc. Do you think we should create separate issues for different small examples that compiles bad perhaps? but on the same time we need to keep track on the 'generic' overall code size issue as eg. CSiBE points out.

Here's is another small example I tested yesterday that also gives unnecessary moves, both for arm7tdmi, arm966e-s and cortex-m0 tested.

extern void func(int data);
char cdata[4];
void test(void) {
  int *idata = (int*)cdata;
  func(*idata);
}

Compiles with GCC 4.8.5 (cortex-m0):

00000000 <test>:
   0:   b508            push    {r3, lr}
   2:   4b07            ldr     r3, [pc, #28]   ; (20 <test+0x20>)
   4:   7858            ldrb    r0, [r3, #1]
   6:   781a            ldrb    r2, [r3, #0]
   8:   0200            lsls    r0, r0, #8
   a:   4310            orrs    r0, r2
   c:   789a            ldrb    r2, [r3, #2]
   e:   78db            ldrb    r3, [r3, #3]
  10:   0412            lsls    r2, r2, #16
  12:   4310            orrs    r0, r2
  14:   061b            lsls    r3, r3, #24
  16:   4318            orrs    r0, r3
  18:   f7ff fffe       bl      0 <func>
  1c:   bd08            pop     {r3, pc}
  1e:   46c0            nop                     ; (mov r8, r8)
  20:   00000000        .word   0x00000000

With GCC 6 master with latest LRA patch (+4 bytes):

00000000 <test>:
   0:   b510            push    {r4, lr}
   2:   4c08            ldr     r4, [pc, #32]   ; (24 <test+0x24>)
   4:   7863            ldrb    r3, [r4, #1]
   6:   7821            ldrb    r1, [r4, #0]
   8:   78a0            ldrb    r0, [r4, #2]
   a:   021b            lsls    r3, r3, #8
   c:   430b            orrs    r3, r1
   e:   0400            lsls    r0, r0, #16
  10:   001a            movs    r2, r3   ??? MOVE
  12:   0003            movs    r3, r0   ??? MOVE
  14:   78e0            ldrb    r0, [r4, #3]
  16:   4313            orrs    r3, r2
  18:   0600            lsls    r0, r0, #24
  1a:   4318            orrs    r0, r3
  1c:   f7ff fffe       bl      0 <func>
  20:   bd10            pop     {r4, pc}
  22:   46c0            nop                     ; (mov r8, r8)
  24:   00000000        .word   0x00000000

Kind Regards, Fredrik
Comment 24 Vladimir Makarov 2015-09-03 14:59:04 UTC
(In reply to Fredrik Hederstierna from comment #23)
> Thanks for your patch, I tried it out, and it solves the small example fine,
> the code now is similar to GCC 4.8 for this particular example.
> 
> Though when I ran the full CSiBE benchmark, the code size unfortunately grew
> approx +150 bytes overall for the full suite. So the patch did not solve the
> generic root problem with code size increase unfortunately.
> 
> This is strange and I'm thinking of how to continue from here, this issue
> has diverged a bit too much (mostly because of my own fault) with several
> examples etc. Do you think we should create separate issues for different
> small examples that compiles bad perhaps? but on the same time we need to
> keep track on the 'generic' overall code size issue as eg. CSiBE points out.
> 

Either way will work for me.  The most important is to have test cases somewhere in Bigzilla on which I could work.

I did not expect that the patch solves CSiBE code size degradation.  Therefore I wrote "a patch for" instead of "patch solving the problem".

I expect it will be a long work to solve the problem.

> Here's is another small example I tested yesterday that also gives
> unnecessary moves, both for arm7tdmi, arm966e-s and cortex-m0 tested.
> 

Thanks.  I'll will investigate it.
Comment 25 Fredrik Hederstierna 2015-09-08 21:37:58 UTC
I've but this last example in a separate issue:
Bug 67507 - Code size increase with -Os from GCC 4.8.x to GCC 4.9.x for ARM thumb1.
I've also previously put this one that causes size increase
Bug 67213 - When compiling for size with -Os loops can get bigger after peeling.
Best Regards, Fredrik
Comment 26 Vladimir Makarov 2015-09-10 20:18:09 UTC
(In reply to Fredrik Hederstierna from comment #23)
> 
> Here's is another small example I tested yesterday that also gives
> unnecessary moves, both for arm7tdmi, arm966e-s and cortex-m0 tested.
> 
> extern void func(int data);
> char cdata[4];
> void test(void) {
>   int *idata = (int*)cdata;
>   func(*idata);
> }
> 
> Compiles with GCC 4.8.5 (cortex-m0):
> 
> 00000000 <test>:
>    0:   b508            push    {r3, lr}
>    2:   4b07            ldr     r3, [pc, #28]   ; (20 <test+0x20>)
>    4:   7858            ldrb    r0, [r3, #1]
>    6:   781a            ldrb    r2, [r3, #0]
>    8:   0200            lsls    r0, r0, #8
>    a:   4310            orrs    r0, r2
>    c:   789a            ldrb    r2, [r3, #2]
>    e:   78db            ldrb    r3, [r3, #3]
>   10:   0412            lsls    r2, r2, #16
>   12:   4310            orrs    r0, r2
>   14:   061b            lsls    r3, r3, #24
>   16:   4318            orrs    r0, r3
>   18:   f7ff fffe       bl      0 <func>
>   1c:   bd08            pop     {r3, pc}
>   1e:   46c0            nop                     ; (mov r8, r8)
>   20:   00000000        .word   0x00000000
> 
> With GCC 6 master with latest LRA patch (+4 bytes):
> 
> 00000000 <test>:
>    0:   b510            push    {r4, lr}
>    2:   4c08            ldr     r4, [pc, #32]   ; (24 <test+0x24>)
>    4:   7863            ldrb    r3, [r4, #1]
>    6:   7821            ldrb    r1, [r4, #0]
>    8:   78a0            ldrb    r0, [r4, #2]
>    a:   021b            lsls    r3, r3, #8
>    c:   430b            orrs    r3, r1
>    e:   0400            lsls    r0, r0, #16
>   10:   001a            movs    r2, r3   ??? MOVE
>   12:   0003            movs    r3, r0   ??? MOVE
>   14:   78e0            ldrb    r0, [r4, #3]
>   16:   4313            orrs    r3, r2
>   18:   0600            lsls    r0, r0, #24
>   1a:   4318            orrs    r0, r3
>   1c:   f7ff fffe       bl      0 <func>
>   20:   bd10            pop     {r4, pc}
>   22:   46c0            nop                     ; (mov r8, r8)
>   24:   00000000        .word   0x00000000
> 
> Kind Regards, Fredrik

I found the problem root.

We have

  insn 9: p115=p114|p112
  ...
  insn 12: p118=p117|p115
  ...

IRA assigns different regs to p112, p115, and p118

      ...
      Popping a0(r121,l0)  -- assign reg 0
      Popping a1(r118,l0)  -- assign reg 3
      Popping a5(r115,l0)  -- assign reg 2
      Popping a8(r112,l0)  -- assign reg 1
      ...

Therefore LRA generates redundant insn 22 for insn 9 and insn 23 for
insn 12 as an input and the output operands should be in the same
register.

There is no conflicts preventing to assign the same hard reg to p112,
p115, and p118 but IRA does not do this following heuristics taking
other conflict pseudos costs into account.

So the solution is to change the heuristics somehow.  Even if I manage
to do this, the changes should be benchmarked on other architectures
thorougly.  It means the PR will need a lot of time to be fixed but I
am going to work on it.
Comment 27 Dominik Vogt 2015-09-22 17:13:41 UTC
The patch in comment 22 that was applied on 2015-09-01 with the git commit id 0af99ebfea26293fc900fe9050c5dd514005e4e5 breaks the S/390 compiler (and maybe other platforms too):

tt.c:
-- snip --
  struct t
  {
    int i;
    void *p;
    void *q;
  };
  void foo (t **g)
  {
    (*g)->p = &((*g)->p);
  }
-- snip --

This code should set "p" to "&p", but actually sets "q" to "&p".  Before the patch, compiling with "g++ -O2" resulted in this assembly code:

tt.s:
  lg    %r1,0(%r2)  # r1 := *g
  la    %r2,8(%r1)  # r2 := &((*g)->p)
  stg   %r2,8(%r1)  # *r2 := (*g) + 8  <--- GOOD

with the patch it uses r1 for both, effectively doubling the store target's offset into the structure:

tt.s:
  lg      %r1,0(%r2) # r1 := *g
  la      %r1,8(%r1) # r1 := &((*g)->p)
  stg     %r1,8(%r1) # *(r1 + 8) := (*g) + 8  <--- BAD


RTL details:
============

tt.c.232r.reload (BAD, with patch):
-- snip --
(insn 6 3 7 2 (set (reg/f:DI 1 %r1 [orig:61 *g_2(D) ] [61])
        (mem/f:DI (reg:DI 2 %r2 [ g ]) [1 *g_2(D)+0 S8 A64])) tt.c:9 900 {*movdi_64}
     (nil))
(insn 7 6 13 2 (parallel [
            (set (reg/f:DI 1 %r1 [orig:61 *g_2(D) ] [61])
                (plus:DI (reg/f:DI 1 %r1 [orig:61 *g_2(D) ] [61])
                    (const_int 8 [0x8])))
            (clobber (reg:CC 33 %cc))
        ]) tt.c:9 1171 {*adddi3}
     (nil))
(insn 13 7 10 2 (set (mem/f:DI (plus:DI (reg/f:DI 1 %r1 [orig:61 *g_2(D) ] [61])
                (const_int 8 [0x8])) [2 _3->p+0 S8 A64])
        (reg/f:DI 1 %r1 [orig:61 *g_2(D) ] [61])) tt.c:9 900 {*movdi_64}
     (nil))
-- snip --


tt.c.232r.reload (GOOD, without patch):
-- snip --
(insn 6 3 12 2 (set (reg/f:DI 1 %r1 [orig:61 *g_2(D) ] [61])
        (mem/f:DI (reg:DI 2 %r2 [ g ]) [1 *g_2(D)+0 S8 A64])) tt.c:9 900 {*movdi_64}
     (nil))
(insn 12 6 7 2 (set (reg:DI 2 %r2 [63])
        (reg/f:DI 1 %r1 [orig:61 *g_2(D) ] [61])) tt.c:9 900 {*movdi_64}
     (nil))
(insn 7 12 13 2 (parallel [
            (set (reg:DI 2 %r2 [63])
                (plus:DI (reg:DI 2 %r2 [63])
                    (const_int 8 [0x8])))
            (clobber (reg:CC 33 %cc))
        ]) tt.c:9 1171 {*adddi3}
     (nil))
(insn 13 7 10 2 (set (mem/f:DI (plus:DI (reg/f:DI 1 %r1 [orig:61 *g_2(D) ] [61])
                (const_int 8 [0x8])) [2 _3->p+0 S8 A64])
        (reg:DI 2 %r2 [63])) tt.c:9 900 {*movdi_64}
     (nil))
-- snip --


tt.c.231r.ira (GOOD):
-- snip --
(insn 6 3 7 2 (set (reg/f:DI 61 [ *g_2(D) ])
        (mem/f:DI (reg:DI 2 %r2 [ g ]) [1 *g_2(D)+0 S8 A64])) tt.c:9 900 {*movdi_64}
     (expr_list:REG_DEAD (reg:DI 2 %r2 [ g ])
        (nil)))
(insn 7 6 10 2 (parallel [
            (set (mem/f:DI (plus:DI (reg/f:DI 61 [ *g_2(D) ])
                        (const_int 8 [0x8])) [2 _3->p+0 S8 A64])
                (plus:DI (reg/f:DI 61 [ *g_2(D) ])
                    (const_int 8 [0x8])))
            (clobber (reg:CC 33 %cc))
        ]) tt.c:9 1171 {*adddi3}
     (expr_list:REG_DEAD (reg/f:DI 61 [ *g_2(D) ])
        (expr_list:REG_UNUSED (reg:CC 33 %cc)
            (nil))))
-- snip --
Comment 28 Dominik Vogt 2015-09-22 17:41:11 UTC
Created attachment 36371 [details]
Outpout of the reload pass (BAD)

The full output of the reload pass on S/390, showing the behaviour described in comment 27.
Comment 29 Dominik Vogt 2015-09-23 12:16:38 UTC
I think I understand what's going on:

Consider the patched code in match_reloads():

+       = (ins[1] < 0 && REG_P (in_rtx)
+          && (int) REGNO (in_rtx) < lra_new_regno_start
+          && find_regno_note (curr_insn, REG_DEAD, REGNO (in_rtx))
+          ? lra_create_new_reg (inmode, in_rtx, goal_class, "")
+          : lra_create_new_reg_with_unique_value (outmode, out_rtx,
+                                                  goal_class, ""));

(1) This code normally makes a unique copy of the register in in_rtx, but if the register is marked as REG_DEAD in the curr_insn, it just makes a copy of the register using lra_create_new_reg(), with the same .val and .offset in the reg_info structure.

(2) Further down in match_reloads, new insns are generated and stored in *before and *after.  However, the new "after" insn still references the old register.  In other words, in step (1) the code has made the assumption that the old register is no longer used, then generates an insn that uses it after it was marked as REG_DEAD.

(3) Based on the bogus decision in (1), the condition in lra-lives.c decides that the two registers are identical copies and can be mapped to the same hard register:

+         && (((src_regno >= FIRST_PSEUDO_REGISTER
+               && (! sparseset_bit_p (pseudos_live, src_regno)
+                   || (dst_regno >= FIRST_PSEUDO_REGISTER
+                       && lra_reg_val_equal_p (src_regno,
+                                               lra_reg_info[dst_regno].val,
+                                               lra_reg_info[dst_regno].offset))
Comment 30 Vladimir Makarov 2015-09-24 20:38:45 UTC
(In reply to Dominik Vogt from comment #29)
> I think I understand what's going on:
> 
> Consider the patched code in match_reloads():
> 
> +       = (ins[1] < 0 && REG_P (in_rtx)
> +          && (int) REGNO (in_rtx) < lra_new_regno_start
> +          && find_regno_note (curr_insn, REG_DEAD, REGNO (in_rtx))
> +          ? lra_create_new_reg (inmode, in_rtx, goal_class, "")
> +          : lra_create_new_reg_with_unique_value (outmode, out_rtx,
> +                                                  goal_class, ""));
> 
> (1) This code normally makes a unique copy of the register in in_rtx, but if
> the register is marked as REG_DEAD in the curr_insn, it just makes a copy of
> the register using lra_create_new_reg(), with the same .val and .offset in
> the reg_info structure.
> 
> (2) Further down in match_reloads, new insns are generated and stored in
> *before and *after.  However, the new "after" insn still references the old
> register.  In other words, in step (1) the code has made the assumption that
> the old register is no longer used, then generates an insn that uses it
> after it was marked as REG_DEAD.
> 
> (3) Based on the bogus decision in (1), the condition in lra-lives.c decides
> that the two registers are identical copies and can be mapped to the same
> hard register:
> 
> +         && (((src_regno >= FIRST_PSEUDO_REGISTER
> +               && (! sparseset_bit_p (pseudos_live, src_regno)
> +                   || (dst_regno >= FIRST_PSEUDO_REGISTER
> +                       && lra_reg_val_equal_p (src_regno,
> +                                               lra_reg_info[dst_regno].val,
> +                                              
> lra_reg_info[dst_regno].offset))

Sorry for breaking s390 port and thanks for the analysis.  I'll try to fix it on this week.
Comment 31 Vladimir Makarov 2015-09-24 20:41:02 UTC
Author: vmakarov
Date: Thu Sep 24 20:40:30 2015
New Revision: 228097

URL: https://gcc.gnu.org/viewcvs?rev=228097&root=gcc&view=rev
Log:
2015-09-24  Vladimir Makarov  <vmakarov@redhat.com>

	PR target/61578
	* ira-color.c (update_allocno_cost): Add parameter.
	(update_costs_from_allocno): Decrease conflict cost.  Pass the new
	parameter.


Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/ira-color.c
Comment 32 Dominik Vogt 2015-09-25 09:12:46 UTC
Update: We've stepped through the code with gdb and noticed that the change in lra-lives.c has nothing to do with our observations.  The new condition there is never executed, so it must be just the change in match_reload().
Comment 33 Vladimir Makarov 2015-09-25 21:06:40 UTC
Author: vmakarov
Date: Fri Sep 25 21:06:08 2015
New Revision: 228153

URL: https://gcc.gnu.org/viewcvs?rev=228153&root=gcc&view=rev
Log:
2015-09-25  Vladimir Makarov  <vmakarov@redhat.com>

	PR target/61578
	* lra-constarints.c (match_reload): Check presence of the input pseudo
	in the output pseudo.


Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/lra-constraints.c
Comment 34 Dominik Vogt 2015-09-28 09:12:26 UTC
The patch in comment 33 fixes the s390 regression.  Thanks!
Comment 35 Dominik Vogt 2016-03-01 10:52:39 UTC
Looks like the extra condition in that patch is still not good enough:

--- a/gcc/lra-constraints.c
+++ b/gcc/lra-constraints.c
@@ -945,6 +945,12 @@ match_reload (signed char out, signed char *ins, enum reg_c
       	= (ins[1] < 0 && REG_P (in_rtx)
           && (int) REGNO (in_rtx) < lra_new_regno_start
           && find_regno_note (curr_insn, REG_DEAD, REGNO (in_rtx))
+          /* We can not use the same value if the pseudo is mentioned
+             in the output, e.g. as an address part in memory,
+             becuase output reload will actually extend the pseudo
+             liveness.  We don't care about eliminable hard regs here
+             as we are interesting only in pseudos.  */
+          && (out < 0 || regno_use_in (REGNO (in_rtx), out_rtx) == NULL_RTX)
           ? lra_create_new_reg (inmode, in_rtx, goal_class, "")
           : lra_create_new_reg_with_unique_value (outmode, out_rtx,
                                                   goal_class, ""));
Comment 36 Dominik Vogt 2016-03-01 10:53:35 UTC
(Sorry, comment 35 belongs to the follow-up report https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70025 )
Comment 37 Ramana Radhakrishnan 2016-03-16 15:58:31 UTC
I've completely lost track of this bug - is this still open on gcc 4.9 / 5 and 6 or just relevant to 4.9 ?
Comment 38 Fredrik Hederstierna 2016-03-20 12:36:17 UTC
I guess this 'meta-bugreport' have become lightly fuzzy with all kinds of CSiBE code size increase issues,
so maybe better to identify these issues on a more detailed level and create smaller specific reports?

I've done some approaches, like try posting Bug 67507 and Bug 67213. I think also the attached source to this issue have some switch-case issue and still becomes larger.
Though I think its better to post that also in a separate issue.

I did a new benchmark yesterday on code size for

arm9_thumb
arm9_arm
cortex-m0
cortex-m3

using newly built compiler toolchains for

gcc 4.6.4
gcc 4.7.4
gcc 4.8.5
gcc 4.9.3
gcc 5.3.0
gcc 6-20160313 snapshot

in total 4*6=24 builds. See attached Excel file for results.
Also you can check out my pre-alpha site at

http://gcc.hederstierna.com/csibe/

(my hope is to be able to build a up-to-date arm-size-benchmark, but its very pre-alpha still.)

The results looks good I think. The overall size is getting smaller. So I think its ok.
Though some files miscompiles into large code, we need to dig into these and look at the specific files  I think.

There are though a proposed patch also attached in this issue regarding arm register IP,
that could be used to further analyze why LRA code might increase in specific cases I think.

Do you still think the ip-patch is valid I can rebase it against git master and submit patch again in a separate issue?

Best Regards, Fredrik
Comment 39 Fredrik Hederstierna 2016-03-20 12:38:15 UTC
Created attachment 38036 [details]
CSiBE results for arm thumb1 and thumb2 code generation for various toolchains.

CSiBE results for arm thumb1 and thumb2 code generation for various toolchains.
Comment 40 Ramana Radhakrishnan 2016-03-21 09:05:11 UTC
(In reply to Fredrik Hederstierna from comment #38)
> I guess this 'meta-bugreport' have become lightly fuzzy with all kinds of
> CSiBE code size increase issues,
> so maybe better to identify these issues on a more detailed level and create
> smaller specific reports?

It is much better to create smaller specific reports, otherwise IMNSHO it becomes impossible to track what's going on with these long running bug reports. 

Conflating multiple issues into one bug report is just going to cause confusion in the long run.

Thus I'd like to close this bug and if you could please open new bug reports for each individual issue that still remains open.

> 
> I've done some approaches, like try posting Bug 67507 and Bug 67213. I think
> also the attached source to this issue have some switch-case issue and still
> becomes larger.
> Though I think its better to post that also in a separate issue

Yes please and thanks a lot for all the benchmarking and digging you have been doing with this.

> 
> I did a new benchmark yesterday on code size for
> 
> arm9_thumb
> arm9_arm
> cortex-m0
> cortex-m3
> 
> using newly built compiler toolchains for
> 
> gcc 4.6.4
> gcc 4.7.4
> gcc 4.8.5
> gcc 4.9.3
> gcc 5.3.0
> gcc 6-20160313 snapshot
> 
> in total 4*6=24 builds. See attached Excel file for results.
> Also you can check out my pre-alpha site at
> 
> http://gcc.hederstierna.com/csibe/
> 
> (my hope is to be able to build a up-to-date arm-size-benchmark, but its
> very pre-alpha still.)
> 
> The results looks good I think. The overall size is getting smaller. So I
> think its ok.
> Though some files miscompiles into large code, we need to dig into these and
> look at the specific files  I think.
> 
> There are though a proposed patch also attached in this issue regarding arm
> register IP,
> that could be used to further analyze why LRA code might increase in
> specific cases I think.
> 
> Do you still think the ip-patch is valid I can rebase it against git master
> and submit patch again in a separate issue?

The policy for patches is posting them at gcc-patches@gcc.gnu.org and reviewing them on the list there. The reason it has not been reviewed is because no one ever spotted it.

However, I don't think the IP patch is an appropriate approach to the problem but the data you have with respect to -ffixed-ip might help understand what's going on with the register allocator. Off the top of my head it sounds odd that we have this problem - In Thumb state IP is available for register allocation quite late in the "order" already and the compiler should prefer the low registers as much as possible.


> 
> Best Regards, Fredrik
Comment 41 Fredrik Hederstierna 2016-03-21 17:02:33 UTC
Ok, I will mark this as resolved if noone object.

I tried the ip-fixed again on master, but the gain was very little, so I do not think this is any way forward currently.

I created this new separate issue from the example submitted in this issue:

Bug 70341 - Code size increase on ARM cortex-m3 for switch statements

Thanks/Fredrik
Comment 42 Ramana Radhakrishnan 2016-03-22 10:40:22 UTC
Fixed for 6.0
Comment 43 avieira 2016-06-13 09:59:06 UTC
Author: avieira
Date: Mon Jun 13 09:58:34 2016
New Revision: 237369

URL: https://gcc.gnu.org/viewcvs?rev=237369&root=gcc&view=rev
Log:
	Backport from Mainline
	2015-09-01  Vladimir Makarov  <vmakarov@redhat.com>

	PR target/61578
	* lra-lives.c (process_bb_lives): Process move pseudos with the
	  same value for copies and preferences
	* lra-constraints.c (match_reload): Create match reload pseudo
	  with the same value from single dying input pseudo.

Modified:
    branches/ARM/embedded-5-branch/gcc/ChangeLog.arm
    branches/ARM/embedded-5-branch/gcc/lra-constraints.c
    branches/ARM/embedded-5-branch/gcc/lra-lives.c
Comment 44 avieira 2016-06-13 10:04:01 UTC
Author: avieira
Date: Mon Jun 13 10:03:30 2016
New Revision: 237371

URL: https://gcc.gnu.org/viewcvs?rev=237371&root=gcc&view=rev
Log:
	Backport from Mainline
	2015-09-25  Vladimir Makarov  <vmakarov@redhat.com>

	PR target/61578
	* lra-constarints.c (match_reload): Check presence of the input pseudo
	  in the output pseudo.

Modified:
    branches/ARM/embedded-5-branch/gcc/ChangeLog.arm
    branches/ARM/embedded-5-branch/gcc/lra-constraints.c