We have problems when trying to switch from GCC 4.8.3 to GCC 4.9.0 (arm-eabi thumb1 for arm966e-s core) due to significant code size increase (1-2%). I attach our toolchain build scripts and two small example programs, though I'm not fully convinced that these examples fully catch the issue. Although the examples show increase +4.81% and +2% growth, though the examples are slightly constructed. First I suspected it was related to Bug 59535, since the code size decrease significantly when compiling without LRA (-mno-lra), though still it does not make the full picture, since still the codes size is ~1% bigger then GCC 4.8.3. We also tried with and without LTO, but it does not solve our problem either. See attached example. Build with 'make' and compare with 'make compare'. Thanks and Best Regards, /Fredrik
Created attachment 32980 [details] Toolchain build script 4.8.3.
Created attachment 32981 [details] Toolchain build script 4.9.0
Created attachment 32982 [details] Makefile for examples.
Created attachment 32983 [details] Example source 1.
Created attachment 32984 [details] Example source 2.
We need preprocessed source for someone to actually try this. CC'ing Vlad as LRA maintainer / author.
(In reply to Ramana Radhakrishnan from comment #6) > We need preprocessed source for someone to actually try this. CC'ing Vlad as > LRA maintainer / author. Thanks for reporting this. Sorry, I am busy with other things right now. I'll start to work on the PR in September.
Created attachment 33544 [details] CSiBE test results size Attached some tests with CSiBE v2.1.1 for arm-eabi. It seems like the results are very scattered, sometimes GCC 4.8.3 produces smaller code than GCC 4.9.1, but on other files it seems to be vice versa. /Fredrik
Created attachment 33866 [details] Simple patch to exclude use of ip Simple patch that make it possible to optionally exclude use of ip for thumb1 when optimizing for size.
Created attachment 33867 [details] Size benchmark gcc 4.8, gcc 4.9 and trunk. Added updated CSiBE benchmark for GCC 4.8, GCC 4.9 and trunk. It's obvious that excluding ip gives shorted code. Then there is something on trunk that makes some project become very large, which should be investigated perhaps.
> Added updated CSiBE benchmark for GCC 4.8, GCC 4.9 and trunk. > It's obvious that excluding ip gives shorted code. > Then there is something on trunk that makes some project become very large, > which should be investigated perhaps. When compiling CSiBE with trunk, please add option "-std=gnu89".
(In reply to Fredrik Hederstierna from comment #9) > Created attachment 33866 [details] > Simple patch to exclude use of ip > > Simple patch that make it possible to optionally exclude use of ip for > thumb1 when optimizing for size. Please follow https://gcc.gnu.org/contribute.html with respect to contributing patches.
Confirmed.
Created attachment 34916 [details] CSiBE benchmark with gnu89, updates with newer trunk as reference. I added attachment with new CSiBE measurement from newer trunk, and now using -std=gnu89 for correctness. It looks alot better on current trunk, the code size is now smaller than 4.8.x, so in this case this issue seems at least partly resolved. Though, still the proposed patch with "-mip-fixed" on trunk, still gets approx -0.2% reduced code size in average, which might seem significant. See attached docs. The CSiBE test also it indicates that LRA might improved specific areas, where the code size gets worse with IP fixed, which could be investigated further. Example file are libmspack/test/cabd_md5.c. So, I'm just wondering if you that are or have been involved with this issue, thinks the proposed patch is a good idea and worth putting time to make it proper for commit? I just do not want to put time and effort in this patch if its not likely to get in, or you think its a bad idea. Please comment :) Thanks and Kind Regards Fredrik
I'm going to remove the regression marker on this as this is now just purely further code size improvements. Please submit patches on gcc-patches@gcc.gnu.org for further discussion. Patches attached to bug reports aren't reviewed.
GCC 4.9.3 has been released.
Created attachment 36201 [details] Simple example giving +50% code size increase compared gcc-4.8.5 and gcc-5.2.0 Simple example giving +50% code size increase compared gcc-4.8.5 and gcc-5.2.0. We still cannot use GCC 4.9, GCC 5.1 and GCC 5.2 due to code size increase. The GCC 4.8.x produces the smallest binaries for us, and we still haven't figured out exactly why. Attached is an attempt to create a test case where GCC 4.9 and after creates bad code. Please check it out. With GCC-5.2.0 we got 55% code size increase compared to GCC-4.8.5. In parallel I work with the CSiBE benchmark and I hope I can contribute with some more metrics soon. Thanks and Best Regards/Fredrik
Created attachment 36202 [details] Disasm for -mthumb also, code size increase was +48%.
I'm not sure why bug 59535 was closed, same problem might still exist, quote: > Zhenqiang Chen 2014-09-03 06:17:44 UTC > > Here is a small case to show lra introduces one more register copy (tested with trunk and 4.9). int isascii (int c) { return c >= 0 && c < 128; } With options: -Os -mthumb -mcpu=cortex-m0, I got isascii: mov r3, #0 mov r2, #127 mov r1, r3 //??? cmp r2, r0 adc r1, r1, r3 mov r0, r1 bx lr With options: -Os -mthumb -mcpu=cortex-m0 -mno-lra, I got isascii: mov r2, #127 mov r3, #0 cmp r2, r0 adc r3, r3, r3 mov r0, r3 bx lr Testing 4.8.5 and 5.2.0 still still produces same bigger code in GCC 5.2. So something adds a register copy in this small case. /Fredrik
(In reply to Fredrik Hederstierna from comment #19) > I'm not sure why bug 59535 was closed, same problem might still exist, quote: > > > Zhenqiang Chen 2014-09-03 06:17:44 UTC > > > > Here is a small case to show lra introduces one more register copy (tested with trunk and 4.9). > > int isascii (int c) > { > return c >= 0 && c < 128; > } > With options: -Os -mthumb -mcpu=cortex-m0, I got > > isascii: > mov r3, #0 > mov r2, #127 > mov r1, r3 //??? > cmp r2, r0 > adc r1, r1, r3 > mov r0, r1 > bx lr > > With options: -Os -mthumb -mcpu=cortex-m0 -mno-lra, I got > > isascii: > mov r2, #127 > mov r3, #0 > cmp r2, r0 > adc r3, r3, r3 > mov r0, r3 > bx lr > > Testing 4.8.5 and 5.2.0 still still produces same bigger code in GCC 5.2. > So something adds a register copy in this small case. Thanks for providing a small test case. I've started to work on the PR. I hope to provide a fix for the case today.
Great, thanks! Just a note as you are looking into this, neither GCC 4.8.x nor GCC 5.2.x produces the optimal code I think for this case, isn't it better to load result register r0, instead of go over r3? GCC 4.8.5 (good version): isascii: mov r2, #127 mov r3, #0 cmp r2, r0 adc r3, r3, r3 mov r0, r3 bx lr better ??: isascii: mov r2, #127 mov r3, #0 cmp r2, r0 + adc r0, r3, r3 <----- put result in R0 directly? - adc r3, r3, r3 - mov r0, r3 bx lr This would save one more instruction if I'm thinking right. BR, Fredrik
Author: vmakarov Date: Tue Sep 1 19:37:52 2015 New Revision: 227382 URL: https://gcc.gnu.org/viewcvs?rev=227382&root=gcc&view=rev Log: 2015-09-01 Vladimir Makarov <vmakarov@redhat.com> PR target/61578 * lra-lives.c (process_bb_lives): Process move pseudos with the same value for copies and preferences * lra-constraints.c (match_reload): Create match reload pseudo with the same value from single dying input pseudo. Modified: trunk/gcc/ChangeLog trunk/gcc/lra-constraints.c trunk/gcc/lra-lives.c
Thanks for your patch, I tried it out, and it solves the small example fine, the code now is similar to GCC 4.8 for this particular example. Though when I ran the full CSiBE benchmark, the code size unfortunately grew approx +150 bytes overall for the full suite. So the patch did not solve the generic root problem with code size increase unfortunately. This is strange and I'm thinking of how to continue from here, this issue has diverged a bit too much (mostly because of my own fault) with several examples etc. Do you think we should create separate issues for different small examples that compiles bad perhaps? but on the same time we need to keep track on the 'generic' overall code size issue as eg. CSiBE points out. Here's is another small example I tested yesterday that also gives unnecessary moves, both for arm7tdmi, arm966e-s and cortex-m0 tested. extern void func(int data); char cdata[4]; void test(void) { int *idata = (int*)cdata; func(*idata); } Compiles with GCC 4.8.5 (cortex-m0): 00000000 <test>: 0: b508 push {r3, lr} 2: 4b07 ldr r3, [pc, #28] ; (20 <test+0x20>) 4: 7858 ldrb r0, [r3, #1] 6: 781a ldrb r2, [r3, #0] 8: 0200 lsls r0, r0, #8 a: 4310 orrs r0, r2 c: 789a ldrb r2, [r3, #2] e: 78db ldrb r3, [r3, #3] 10: 0412 lsls r2, r2, #16 12: 4310 orrs r0, r2 14: 061b lsls r3, r3, #24 16: 4318 orrs r0, r3 18: f7ff fffe bl 0 <func> 1c: bd08 pop {r3, pc} 1e: 46c0 nop ; (mov r8, r8) 20: 00000000 .word 0x00000000 With GCC 6 master with latest LRA patch (+4 bytes): 00000000 <test>: 0: b510 push {r4, lr} 2: 4c08 ldr r4, [pc, #32] ; (24 <test+0x24>) 4: 7863 ldrb r3, [r4, #1] 6: 7821 ldrb r1, [r4, #0] 8: 78a0 ldrb r0, [r4, #2] a: 021b lsls r3, r3, #8 c: 430b orrs r3, r1 e: 0400 lsls r0, r0, #16 10: 001a movs r2, r3 ??? MOVE 12: 0003 movs r3, r0 ??? MOVE 14: 78e0 ldrb r0, [r4, #3] 16: 4313 orrs r3, r2 18: 0600 lsls r0, r0, #24 1a: 4318 orrs r0, r3 1c: f7ff fffe bl 0 <func> 20: bd10 pop {r4, pc} 22: 46c0 nop ; (mov r8, r8) 24: 00000000 .word 0x00000000 Kind Regards, Fredrik
(In reply to Fredrik Hederstierna from comment #23) > Thanks for your patch, I tried it out, and it solves the small example fine, > the code now is similar to GCC 4.8 for this particular example. > > Though when I ran the full CSiBE benchmark, the code size unfortunately grew > approx +150 bytes overall for the full suite. So the patch did not solve the > generic root problem with code size increase unfortunately. > > This is strange and I'm thinking of how to continue from here, this issue > has diverged a bit too much (mostly because of my own fault) with several > examples etc. Do you think we should create separate issues for different > small examples that compiles bad perhaps? but on the same time we need to > keep track on the 'generic' overall code size issue as eg. CSiBE points out. > Either way will work for me. The most important is to have test cases somewhere in Bigzilla on which I could work. I did not expect that the patch solves CSiBE code size degradation. Therefore I wrote "a patch for" instead of "patch solving the problem". I expect it will be a long work to solve the problem. > Here's is another small example I tested yesterday that also gives > unnecessary moves, both for arm7tdmi, arm966e-s and cortex-m0 tested. > Thanks. I'll will investigate it.
I've but this last example in a separate issue: Bug 67507 - Code size increase with -Os from GCC 4.8.x to GCC 4.9.x for ARM thumb1. I've also previously put this one that causes size increase Bug 67213 - When compiling for size with -Os loops can get bigger after peeling. Best Regards, Fredrik
(In reply to Fredrik Hederstierna from comment #23) > > Here's is another small example I tested yesterday that also gives > unnecessary moves, both for arm7tdmi, arm966e-s and cortex-m0 tested. > > extern void func(int data); > char cdata[4]; > void test(void) { > int *idata = (int*)cdata; > func(*idata); > } > > Compiles with GCC 4.8.5 (cortex-m0): > > 00000000 <test>: > 0: b508 push {r3, lr} > 2: 4b07 ldr r3, [pc, #28] ; (20 <test+0x20>) > 4: 7858 ldrb r0, [r3, #1] > 6: 781a ldrb r2, [r3, #0] > 8: 0200 lsls r0, r0, #8 > a: 4310 orrs r0, r2 > c: 789a ldrb r2, [r3, #2] > e: 78db ldrb r3, [r3, #3] > 10: 0412 lsls r2, r2, #16 > 12: 4310 orrs r0, r2 > 14: 061b lsls r3, r3, #24 > 16: 4318 orrs r0, r3 > 18: f7ff fffe bl 0 <func> > 1c: bd08 pop {r3, pc} > 1e: 46c0 nop ; (mov r8, r8) > 20: 00000000 .word 0x00000000 > > With GCC 6 master with latest LRA patch (+4 bytes): > > 00000000 <test>: > 0: b510 push {r4, lr} > 2: 4c08 ldr r4, [pc, #32] ; (24 <test+0x24>) > 4: 7863 ldrb r3, [r4, #1] > 6: 7821 ldrb r1, [r4, #0] > 8: 78a0 ldrb r0, [r4, #2] > a: 021b lsls r3, r3, #8 > c: 430b orrs r3, r1 > e: 0400 lsls r0, r0, #16 > 10: 001a movs r2, r3 ??? MOVE > 12: 0003 movs r3, r0 ??? MOVE > 14: 78e0 ldrb r0, [r4, #3] > 16: 4313 orrs r3, r2 > 18: 0600 lsls r0, r0, #24 > 1a: 4318 orrs r0, r3 > 1c: f7ff fffe bl 0 <func> > 20: bd10 pop {r4, pc} > 22: 46c0 nop ; (mov r8, r8) > 24: 00000000 .word 0x00000000 > > Kind Regards, Fredrik I found the problem root. We have insn 9: p115=p114|p112 ... insn 12: p118=p117|p115 ... IRA assigns different regs to p112, p115, and p118 ... Popping a0(r121,l0) -- assign reg 0 Popping a1(r118,l0) -- assign reg 3 Popping a5(r115,l0) -- assign reg 2 Popping a8(r112,l0) -- assign reg 1 ... Therefore LRA generates redundant insn 22 for insn 9 and insn 23 for insn 12 as an input and the output operands should be in the same register. There is no conflicts preventing to assign the same hard reg to p112, p115, and p118 but IRA does not do this following heuristics taking other conflict pseudos costs into account. So the solution is to change the heuristics somehow. Even if I manage to do this, the changes should be benchmarked on other architectures thorougly. It means the PR will need a lot of time to be fixed but I am going to work on it.
The patch in comment 22 that was applied on 2015-09-01 with the git commit id 0af99ebfea26293fc900fe9050c5dd514005e4e5 breaks the S/390 compiler (and maybe other platforms too): tt.c: -- snip -- struct t { int i; void *p; void *q; }; void foo (t **g) { (*g)->p = &((*g)->p); } -- snip -- This code should set "p" to "&p", but actually sets "q" to "&p". Before the patch, compiling with "g++ -O2" resulted in this assembly code: tt.s: lg %r1,0(%r2) # r1 := *g la %r2,8(%r1) # r2 := &((*g)->p) stg %r2,8(%r1) # *r2 := (*g) + 8 <--- GOOD with the patch it uses r1 for both, effectively doubling the store target's offset into the structure: tt.s: lg %r1,0(%r2) # r1 := *g la %r1,8(%r1) # r1 := &((*g)->p) stg %r1,8(%r1) # *(r1 + 8) := (*g) + 8 <--- BAD RTL details: ============ tt.c.232r.reload (BAD, with patch): -- snip -- (insn 6 3 7 2 (set (reg/f:DI 1 %r1 [orig:61 *g_2(D) ] [61]) (mem/f:DI (reg:DI 2 %r2 [ g ]) [1 *g_2(D)+0 S8 A64])) tt.c:9 900 {*movdi_64} (nil)) (insn 7 6 13 2 (parallel [ (set (reg/f:DI 1 %r1 [orig:61 *g_2(D) ] [61]) (plus:DI (reg/f:DI 1 %r1 [orig:61 *g_2(D) ] [61]) (const_int 8 [0x8]))) (clobber (reg:CC 33 %cc)) ]) tt.c:9 1171 {*adddi3} (nil)) (insn 13 7 10 2 (set (mem/f:DI (plus:DI (reg/f:DI 1 %r1 [orig:61 *g_2(D) ] [61]) (const_int 8 [0x8])) [2 _3->p+0 S8 A64]) (reg/f:DI 1 %r1 [orig:61 *g_2(D) ] [61])) tt.c:9 900 {*movdi_64} (nil)) -- snip -- tt.c.232r.reload (GOOD, without patch): -- snip -- (insn 6 3 12 2 (set (reg/f:DI 1 %r1 [orig:61 *g_2(D) ] [61]) (mem/f:DI (reg:DI 2 %r2 [ g ]) [1 *g_2(D)+0 S8 A64])) tt.c:9 900 {*movdi_64} (nil)) (insn 12 6 7 2 (set (reg:DI 2 %r2 [63]) (reg/f:DI 1 %r1 [orig:61 *g_2(D) ] [61])) tt.c:9 900 {*movdi_64} (nil)) (insn 7 12 13 2 (parallel [ (set (reg:DI 2 %r2 [63]) (plus:DI (reg:DI 2 %r2 [63]) (const_int 8 [0x8]))) (clobber (reg:CC 33 %cc)) ]) tt.c:9 1171 {*adddi3} (nil)) (insn 13 7 10 2 (set (mem/f:DI (plus:DI (reg/f:DI 1 %r1 [orig:61 *g_2(D) ] [61]) (const_int 8 [0x8])) [2 _3->p+0 S8 A64]) (reg:DI 2 %r2 [63])) tt.c:9 900 {*movdi_64} (nil)) -- snip -- tt.c.231r.ira (GOOD): -- snip -- (insn 6 3 7 2 (set (reg/f:DI 61 [ *g_2(D) ]) (mem/f:DI (reg:DI 2 %r2 [ g ]) [1 *g_2(D)+0 S8 A64])) tt.c:9 900 {*movdi_64} (expr_list:REG_DEAD (reg:DI 2 %r2 [ g ]) (nil))) (insn 7 6 10 2 (parallel [ (set (mem/f:DI (plus:DI (reg/f:DI 61 [ *g_2(D) ]) (const_int 8 [0x8])) [2 _3->p+0 S8 A64]) (plus:DI (reg/f:DI 61 [ *g_2(D) ]) (const_int 8 [0x8]))) (clobber (reg:CC 33 %cc)) ]) tt.c:9 1171 {*adddi3} (expr_list:REG_DEAD (reg/f:DI 61 [ *g_2(D) ]) (expr_list:REG_UNUSED (reg:CC 33 %cc) (nil)))) -- snip --
Created attachment 36371 [details] Outpout of the reload pass (BAD) The full output of the reload pass on S/390, showing the behaviour described in comment 27.
I think I understand what's going on: Consider the patched code in match_reloads(): + = (ins[1] < 0 && REG_P (in_rtx) + && (int) REGNO (in_rtx) < lra_new_regno_start + && find_regno_note (curr_insn, REG_DEAD, REGNO (in_rtx)) + ? lra_create_new_reg (inmode, in_rtx, goal_class, "") + : lra_create_new_reg_with_unique_value (outmode, out_rtx, + goal_class, "")); (1) This code normally makes a unique copy of the register in in_rtx, but if the register is marked as REG_DEAD in the curr_insn, it just makes a copy of the register using lra_create_new_reg(), with the same .val and .offset in the reg_info structure. (2) Further down in match_reloads, new insns are generated and stored in *before and *after. However, the new "after" insn still references the old register. In other words, in step (1) the code has made the assumption that the old register is no longer used, then generates an insn that uses it after it was marked as REG_DEAD. (3) Based on the bogus decision in (1), the condition in lra-lives.c decides that the two registers are identical copies and can be mapped to the same hard register: + && (((src_regno >= FIRST_PSEUDO_REGISTER + && (! sparseset_bit_p (pseudos_live, src_regno) + || (dst_regno >= FIRST_PSEUDO_REGISTER + && lra_reg_val_equal_p (src_regno, + lra_reg_info[dst_regno].val, + lra_reg_info[dst_regno].offset))
(In reply to Dominik Vogt from comment #29) > I think I understand what's going on: > > Consider the patched code in match_reloads(): > > + = (ins[1] < 0 && REG_P (in_rtx) > + && (int) REGNO (in_rtx) < lra_new_regno_start > + && find_regno_note (curr_insn, REG_DEAD, REGNO (in_rtx)) > + ? lra_create_new_reg (inmode, in_rtx, goal_class, "") > + : lra_create_new_reg_with_unique_value (outmode, out_rtx, > + goal_class, "")); > > (1) This code normally makes a unique copy of the register in in_rtx, but if > the register is marked as REG_DEAD in the curr_insn, it just makes a copy of > the register using lra_create_new_reg(), with the same .val and .offset in > the reg_info structure. > > (2) Further down in match_reloads, new insns are generated and stored in > *before and *after. However, the new "after" insn still references the old > register. In other words, in step (1) the code has made the assumption that > the old register is no longer used, then generates an insn that uses it > after it was marked as REG_DEAD. > > (3) Based on the bogus decision in (1), the condition in lra-lives.c decides > that the two registers are identical copies and can be mapped to the same > hard register: > > + && (((src_regno >= FIRST_PSEUDO_REGISTER > + && (! sparseset_bit_p (pseudos_live, src_regno) > + || (dst_regno >= FIRST_PSEUDO_REGISTER > + && lra_reg_val_equal_p (src_regno, > + lra_reg_info[dst_regno].val, > + > lra_reg_info[dst_regno].offset)) Sorry for breaking s390 port and thanks for the analysis. I'll try to fix it on this week.
Author: vmakarov Date: Thu Sep 24 20:40:30 2015 New Revision: 228097 URL: https://gcc.gnu.org/viewcvs?rev=228097&root=gcc&view=rev Log: 2015-09-24 Vladimir Makarov <vmakarov@redhat.com> PR target/61578 * ira-color.c (update_allocno_cost): Add parameter. (update_costs_from_allocno): Decrease conflict cost. Pass the new parameter. Modified: trunk/gcc/ChangeLog trunk/gcc/ira-color.c
Update: We've stepped through the code with gdb and noticed that the change in lra-lives.c has nothing to do with our observations. The new condition there is never executed, so it must be just the change in match_reload().
Author: vmakarov Date: Fri Sep 25 21:06:08 2015 New Revision: 228153 URL: https://gcc.gnu.org/viewcvs?rev=228153&root=gcc&view=rev Log: 2015-09-25 Vladimir Makarov <vmakarov@redhat.com> PR target/61578 * lra-constarints.c (match_reload): Check presence of the input pseudo in the output pseudo. Modified: trunk/gcc/ChangeLog trunk/gcc/lra-constraints.c
The patch in comment 33 fixes the s390 regression. Thanks!
Looks like the extra condition in that patch is still not good enough: --- a/gcc/lra-constraints.c +++ b/gcc/lra-constraints.c @@ -945,6 +945,12 @@ match_reload (signed char out, signed char *ins, enum reg_c = (ins[1] < 0 && REG_P (in_rtx) && (int) REGNO (in_rtx) < lra_new_regno_start && find_regno_note (curr_insn, REG_DEAD, REGNO (in_rtx)) + /* We can not use the same value if the pseudo is mentioned + in the output, e.g. as an address part in memory, + becuase output reload will actually extend the pseudo + liveness. We don't care about eliminable hard regs here + as we are interesting only in pseudos. */ + && (out < 0 || regno_use_in (REGNO (in_rtx), out_rtx) == NULL_RTX) ? lra_create_new_reg (inmode, in_rtx, goal_class, "") : lra_create_new_reg_with_unique_value (outmode, out_rtx, goal_class, ""));
(Sorry, comment 35 belongs to the follow-up report https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70025 )
I've completely lost track of this bug - is this still open on gcc 4.9 / 5 and 6 or just relevant to 4.9 ?
I guess this 'meta-bugreport' have become lightly fuzzy with all kinds of CSiBE code size increase issues, so maybe better to identify these issues on a more detailed level and create smaller specific reports? I've done some approaches, like try posting Bug 67507 and Bug 67213. I think also the attached source to this issue have some switch-case issue and still becomes larger. Though I think its better to post that also in a separate issue. I did a new benchmark yesterday on code size for arm9_thumb arm9_arm cortex-m0 cortex-m3 using newly built compiler toolchains for gcc 4.6.4 gcc 4.7.4 gcc 4.8.5 gcc 4.9.3 gcc 5.3.0 gcc 6-20160313 snapshot in total 4*6=24 builds. See attached Excel file for results. Also you can check out my pre-alpha site at http://gcc.hederstierna.com/csibe/ (my hope is to be able to build a up-to-date arm-size-benchmark, but its very pre-alpha still.) The results looks good I think. The overall size is getting smaller. So I think its ok. Though some files miscompiles into large code, we need to dig into these and look at the specific files I think. There are though a proposed patch also attached in this issue regarding arm register IP, that could be used to further analyze why LRA code might increase in specific cases I think. Do you still think the ip-patch is valid I can rebase it against git master and submit patch again in a separate issue? Best Regards, Fredrik
Created attachment 38036 [details] CSiBE results for arm thumb1 and thumb2 code generation for various toolchains. CSiBE results for arm thumb1 and thumb2 code generation for various toolchains.
(In reply to Fredrik Hederstierna from comment #38) > I guess this 'meta-bugreport' have become lightly fuzzy with all kinds of > CSiBE code size increase issues, > so maybe better to identify these issues on a more detailed level and create > smaller specific reports? It is much better to create smaller specific reports, otherwise IMNSHO it becomes impossible to track what's going on with these long running bug reports. Conflating multiple issues into one bug report is just going to cause confusion in the long run. Thus I'd like to close this bug and if you could please open new bug reports for each individual issue that still remains open. > > I've done some approaches, like try posting Bug 67507 and Bug 67213. I think > also the attached source to this issue have some switch-case issue and still > becomes larger. > Though I think its better to post that also in a separate issue Yes please and thanks a lot for all the benchmarking and digging you have been doing with this. > > I did a new benchmark yesterday on code size for > > arm9_thumb > arm9_arm > cortex-m0 > cortex-m3 > > using newly built compiler toolchains for > > gcc 4.6.4 > gcc 4.7.4 > gcc 4.8.5 > gcc 4.9.3 > gcc 5.3.0 > gcc 6-20160313 snapshot > > in total 4*6=24 builds. See attached Excel file for results. > Also you can check out my pre-alpha site at > > http://gcc.hederstierna.com/csibe/ > > (my hope is to be able to build a up-to-date arm-size-benchmark, but its > very pre-alpha still.) > > The results looks good I think. The overall size is getting smaller. So I > think its ok. > Though some files miscompiles into large code, we need to dig into these and > look at the specific files I think. > > There are though a proposed patch also attached in this issue regarding arm > register IP, > that could be used to further analyze why LRA code might increase in > specific cases I think. > > Do you still think the ip-patch is valid I can rebase it against git master > and submit patch again in a separate issue? The policy for patches is posting them at gcc-patches@gcc.gnu.org and reviewing them on the list there. The reason it has not been reviewed is because no one ever spotted it. However, I don't think the IP patch is an appropriate approach to the problem but the data you have with respect to -ffixed-ip might help understand what's going on with the register allocator. Off the top of my head it sounds odd that we have this problem - In Thumb state IP is available for register allocation quite late in the "order" already and the compiler should prefer the low registers as much as possible. > > Best Regards, Fredrik
Ok, I will mark this as resolved if noone object. I tried the ip-fixed again on master, but the gain was very little, so I do not think this is any way forward currently. I created this new separate issue from the example submitted in this issue: Bug 70341 - Code size increase on ARM cortex-m3 for switch statements Thanks/Fredrik
Fixed for 6.0
Author: avieira Date: Mon Jun 13 09:58:34 2016 New Revision: 237369 URL: https://gcc.gnu.org/viewcvs?rev=237369&root=gcc&view=rev Log: Backport from Mainline 2015-09-01 Vladimir Makarov <vmakarov@redhat.com> PR target/61578 * lra-lives.c (process_bb_lives): Process move pseudos with the same value for copies and preferences * lra-constraints.c (match_reload): Create match reload pseudo with the same value from single dying input pseudo. Modified: branches/ARM/embedded-5-branch/gcc/ChangeLog.arm branches/ARM/embedded-5-branch/gcc/lra-constraints.c branches/ARM/embedded-5-branch/gcc/lra-lives.c
Author: avieira Date: Mon Jun 13 10:03:30 2016 New Revision: 237371 URL: https://gcc.gnu.org/viewcvs?rev=237371&root=gcc&view=rev Log: Backport from Mainline 2015-09-25 Vladimir Makarov <vmakarov@redhat.com> PR target/61578 * lra-constarints.c (match_reload): Check presence of the input pseudo in the output pseudo. Modified: branches/ARM/embedded-5-branch/gcc/ChangeLog.arm branches/ARM/embedded-5-branch/gcc/lra-constraints.c