Summary: | [4.9 regression] -Os code size regressions for Thumb1/Thumb2 with LRA | ||
---|---|---|---|
Product: | gcc | Reporter: | Richard Earnshaw <rearnsha> |
Component: | rtl-optimization | Assignee: | Vladimir Makarov <vmakarov> |
Status: | RESOLVED FIXED | ||
Severity: | normal | CC: | fredrik.hederstierna, jeffreyalaw, ramana.radhakrishnan, ramana, vmakarov, vmakarov |
Priority: | P1 | Keywords: | missed-optimization, ra |
Version: | 4.9.0 | ||
Target Milestone: | 4.9.0 | ||
Host: | Target: | arm | |
Build: | Known to work: | ||
Known to fail: | Last reconfirmed: | 2013-12-19 00:00:00 | |
Attachments: |
testcase
Another testcase |
Description
Richard Earnshaw
2013-12-17 09:30:40 UTC
CSiBE code size results for Thumb2 2013/12/09 2543786 2013/12/11 2563522 CSiBE code size results for Thumb1 2013/12/09 2634640 2013/12/11 2683980 =1.8% size regression. It seems that one of the major problems is a significant increase in the number of mov instructions to copy registers from one location to another. This is probably due to the two-operand format instructions we have in Thumb. Created attachment 31455 [details]
testcase
Compile with -Os -mthumb -mcpu=arm7tdmi -fno-short-enums and either -mlra or -mno-lra
Number of register-register move operations in the testcase lra: 208 no-lra: 105 Created attachment 31457 [details]
Another testcase
Another testcase, but this one has some obvious examples of poor behaviour for -Os.
In addtion to the options used on the previous case, this might need
-fno-strict-aliasing -fno-common -fomit-frame-pointer -fno-strength-reduce
Example one, spilling a value and then keeping a copy in a hard reg over a call.
mov r5, r1 <= R1 copied to R5
sub sp, sp, #28
str r1, [sp, #8] <= And spilled to the stack
mov r2, #12
mov r1, #0
mov r4, r0
bl memset
mov r3, #2
mov r2, r5 <= Could reload from the stack instead
Example two, use of multiple reloads to use high register:
ldr r3, [sp, #4]
mov ip, r3 <= Copying value into high register
add ip, ip, r5 <= Arithmetic
mov r3, ip <= Copying result back to original register
str r3, [sp, #4]
ldr r3, [sp, #12]
mov ip, r3 <= And IP is dead anyway...
In this case,
mov ip, r3
add ip, ip, r5
mov r3, ip
can be replaced entirely with
add r3, r5
saving two completely unnecessary MOV instructions.
Third, related case,
mov r1, #12
mov ip, r1
add ip, ip, r4
mov r1, ip
Could be done either as
mov r1, #12
add r1, r4
mov ip, r1
or
mov r1, r4
add r1, #12
mov ip, r1
both saving one instruction, or even two if the value doesn't really need copying to a high reg.
I'm not able to reproduce with today's trunk or rev #205887 built for arm-none-linux-gnueabi, as the compiler ICEs in assign_by_spills (lra-assign.c). What is your configuration ? Forget my previous comment, I can reproduce it after a rebase. (In reply to Richard Earnshaw from comment #5) > Number of register-register move operations in the testcase > lra: 208 > no-lra: 105 That is huge degradation. And I guess it is not only because of 2-ops insns. It works adequate on x86/x86-64. Although I know some issues with 2-ops commutative insns (I am trying to fix it too). I think major problem is in wrong alternative choices as thumb lo/hi reg usage is complicated. I take this bug very seriously. If I cannot fix it till end of Jan (sorry, it is a vacation time), probably we should switch to reload pass for thumb. ./cc1 -Os -mthumb -mcpu=arm7tdmi -fno-short-enums test.i -m{no-}lra -fno-schedule-insns2 original reload reload with change lra with change reg moves 104 86 101 all moves 210 192 207 loads 561 569 541 stores 234 231 210 overall 1005 992 958 mov/ldr/str First of all, I can not compile the test with LRA because of its crash. The problem is in choosing alternatives different from reload. For some insn alternatives, LRA needs two hi regs for reloads. But there is only one 12, all others hi-regs are fixed. To be honest, I don't know why 12 is not fixed. It results in using 12 by IRA and bigger code when even reload is used. I believe it should fixed too. Second, big number of moves is not a bad thing. It should considered with # ld/st. LRA generates 15 more reg moves but 28 less loads and 21 less stores. It means LRA with the patch generates smaller and faster code than reload. Here is the patch improving code size for reload and fixing LRA crash and improving code for LRA. If arm part of the patch is ok, please let me know. I'll commit the patch then. The rest of the patch does not affect the test code generation but it might affect other code generated with -Os. Index: config/arm/arm.c =================================================================== --- config/arm/arm.c (revision 206089) +++ config/arm/arm.c (working copy) @@ -29236,6 +29236,7 @@ arm_conditional_register_usage (void) for (regno = FIRST_HI_REGNUM; regno <= LAST_HI_REGNUM; ++regno) fixed_regs[regno] = call_used_regs[regno] = 1; + fixed_regs[12] = call_used_regs[12] = 1; } /* The link register can be clobbered by any branch insn, Index: lra-assigns.c =================================================================== --- lra-assigns.c (revision 206089) +++ lra-assigns.c (working copy) @@ -612,7 +612,9 @@ find_hard_regno_for (int regno, int *cos && ! df_regs_ever_live_p (hard_regno + j)) /* It needs save restore. */ hard_regno_costs[hard_regno] - += 2 * ENTRY_BLOCK_PTR_FOR_FN (cfun)->next_bb->frequency + 1; + += (2 + * REG_FREQ_FROM_BB (ENTRY_BLOCK_PTR_FOR_FN (cfun)->next_bb) + + 1); priority = targetm.register_priority (hard_regno); if (best_hard_regno < 0 || hard_regno_costs[hard_regno] < best_cost || (hard_regno_costs[hard_regno] == best_cost Index: lra-coalesce.c =================================================================== --- lra-coalesce.c (revision 206089) +++ lra-coalesce.c (working copy) @@ -79,8 +79,8 @@ move_freq_compare_func (const void *v1p, rtx mv2 = *(const rtx *) v2p; int pri1, pri2; - pri1 = BLOCK_FOR_INSN (mv1)->frequency; - pri2 = BLOCK_FOR_INSN (mv2)->frequency; + pri1 = REG_FREQ_FROM_BB (BLOCK_FOR_INSN (mv1)); + pri2 = REG_FREQ_FROM_BB (BLOCK_FOR_INSN (mv2)); if (pri2 - pri1) return pri2 - pri1; @@ -277,7 +277,7 @@ lra_coalesce (void) fprintf (lra_dump_file, " Coalescing move %i:r%d-r%d (freq=%d)\n", INSN_UID (mv), sregno, dregno, - BLOCK_FOR_INSN (mv)->frequency); + REG_FREQ_FROM_BB (BLOCK_FOR_INSN (mv))); /* We updated involved_insns_bitmap when doing the merge. */ } else if (!(lra_intersected_live_ranges_p @@ -291,7 +291,7 @@ lra_coalesce (void) " Coalescing move %i:r%d(%d)-r%d(%d) (freq=%d)\n", INSN_UID (mv), sregno, ORIGINAL_REGNO (SET_SRC (set)), dregno, ORIGINAL_REGNO (SET_DEST (set)), - BLOCK_FOR_INSN (mv)->frequency); + REG_FREQ_FROM_BB (BLOCK_FOR_INSN (mv))); bitmap_ior_into (&involved_insns_bitmap, &lra_reg_info[sregno].insn_bitmap); bitmap_ior_into (&involved_insns_bitmap, @@ -316,7 +316,8 @@ lra_coalesce (void) /* Coalesced move. */ if (lra_dump_file != NULL) fprintf (lra_dump_file, " Removing move %i (freq=%d)\n", - INSN_UID (insn), BLOCK_FOR_INSN (insn)->frequency); + INSN_UID (insn), + REG_FREQ_FROM_BB (BLOCK_FOR_INSN (insn))); lra_set_insn_deleted (insn); } } Index: lra-constraints.c =================================================================== --- lra-constraints.c (revision 206089) +++ lra-constraints.c (working copy) @@ -4077,7 +4077,7 @@ lra_constraints (bool first_p) fprintf (lra_dump_file, " Removing equiv init insn %i (freq=%d)\n", INSN_UID (curr_insn), - BLOCK_FOR_INSN (curr_insn)->frequency); + REG_FREQ_FROM_BB (BLOCK_FOR_INSN (curr_insn))); dump_insn_slim (lra_dump_file, curr_insn); } if (contains_reg_p (x, true, false)) (In reply to Vladimir Makarov from comment #9) > (In reply to Richard Earnshaw from comment #5) > I think major problem is in wrong alternative choices as thumb lo/hi reg > usage is complicated. That is probably going to be the reason. > > I take this bug very seriously. If I cannot fix it till end of Jan (sorry, > it is a vacation time), probably we should switch to reload pass for thumb. Thanks very much for taking this so seriously. > > To be honest, I don't know why 12 is not fixed. It results in using > 12 by IRA and bigger code when even reload is used. I believe it > should fixed too. According to the ABI r12 is a caller saved register. Am I correct in understanding that we are taking out a caller-saved register to allow for lra to be able to choose other low regs and thereby prevent movements to and from r12 ? Taking out r12 will have performance implications, so any change here needs to be benchmarked very carefully on all cores with Thumb1 and Thumb2 to make sure that the impact. Additionally we shouldn't be needing to do this in ARM state, so this patchlet as it stands is not enough. regards Ramana (In reply to Ramana Radhakrishnan from comment #11) > (In reply to Vladimir Makarov from comment #9) > > (In reply to Richard Earnshaw from comment #5) > > > I think major problem is in wrong alternative choices as thumb lo/hi reg > > usage is complicated. > > That is probably going to be the reason. > > > > > I take this bug very seriously. If I cannot fix it till end of Jan (sorry, > > it is a vacation time), probably we should switch to reload pass for thumb. > > Thanks very much for taking this so seriously. > > > > > > To be honest, I don't know why 12 is not fixed. It results in using > > 12 by IRA and bigger code when even reload is used. I believe it > > should fixed too. > > > According to the ABI r12 is a caller saved register. Am I correct in > understanding that we are taking out a caller-saved register to allow for > lra to be able to choose other low regs and thereby prevent movements to and > from r12 ? > Scratch that : it's already for t16 and Os. Changing LAST_HI_REGNUM to 14 in arm.h should be enough. regards Ramana The original reason we took most high registers out of the available registers list for -Os is because saving them (they're mostly callee-saved) is quite expensive -- they have to be copied to low registers first. IP was the exception, since it's call-clobbered and there's thus no code cost to make it available. While I'm happy to take IP out of the available registers list for now as a short-term expedient work-around, I see this as being indicative a more fundamental problem. When not optimizing for size, the register allocator should be able to use the high registers as alternatives to spill slots (copy the value to a high register rather than spill it to the stack). Once we have that, it would make sense for the few cases where high registers are permitted to legitimately use the copy that is up there rather than moving it back again to a low register first, so I wouldn't want to change the insn patterns to remove high registers entirely from operations like add, sub and compare. On the other hand, LRA does need to be aware that once a value is in a high register, really spilling it to memory, or loading it from memory is an expensive operation, since such transfers have to go via a low register. (In reply to Richard Earnshaw from comment #13) > When not optimizing for size, the register allocator should be able to use > the high registers as alternatives to spill slots (copy the value to a high > register rather than spill it to the stack). Once we have that, it would > make sense for the few cases where high registers are permitted to > legitimately use the copy that is up there rather than moving it back again > to a low register first, so I wouldn't want to change the insn patterns to > remove high registers entirely from operations like add, sub and compare. > I'll continue my work on this problem in 2 weeks (after my vacation). I'd like to implement the proposed spilling into high regs and try to tune choosing insn alternatives closer to reload (but it is very sensitive area as it affects code generation for other LRA targets). > On the other hand, LRA does need to be aware that once a value is in a high > register, really spilling it to memory, or loading it from memory is an > expensive operation, since such transfers have to go via a low register. Another testcase where the thumb1 code is poor is gcc.c-torture/execute/pr28982b.c With LRA we often get sequences such as: mov r3, sp ldr r2, .L8+16 add r3, r3, r2 // r2 dead. str r0, [r3] Instead of: ldr r2, .L8+16 add r2, r2, sp str r0, [r2] Which is both shorter and needs fewer registers. Author: vmakarov Date: Fri Feb 14 16:18:29 2014 New Revision: 207787 URL: http://gcc.gnu.org/viewcvs?rev=207787&root=gcc&view=rev Log: 2014-02-14 Vladimir Makarov <vmakarov@redhat.com> Richard Earnshaw <rearnsha@arm.com> PR rtl-optimization/59535 * lra-constraints.c (process_alt_operands): Encourage alternative when unassigned pseudo class is superset of the alternative class. (inherit_reload_reg): Don't inherit when optimizing for code size. * config/arm/arm.h (MODE_BASE_REG_CLASS): Add version for LRA returning CORE_REGS for anything but Thumb1 and BASE_REGS for modes not less than 4 for Thumb1. Modified: trunk/gcc/ChangeLog trunk/gcc/config/arm/arm.h trunk/gcc/lra-constraints.c Should be fixed by Vlad's trunk commit. I compared GCC 4.8.3 and GCC 4.9.0 for arm-none-eabi, and I still see a code size increase for thumb1 (and thumb2) for both my arm966e and my cortex-m4 targets. GCC 4.8.3 RAM used 93812 Flash used 515968 GCC 4.9.0 RAM used 93812 (same) Flash used 522608 (+1.3%) Then I tried to disable LRA and results got better: GCC 4.9.0 : added flag "-mno-lra" RAM used 93812 (same) Flash used 519624 (+0.7%) Flags used are otherwise identical for both tests: -Os -g3 -ggdb3 -gdwarf-4 -fvar-tracking-assignments -fverbose-asm -fno-common -ffunction-sections -fdata-sections -fno-exceptions -fno-asynchronous-unwind-tables -fno-unwind-tables -mthumb -mcpu=arm966e-s -msoft-float -mno-unaligned-access Generally GCC 4.9.0 seems to produce larger code, I tried to experiement with LTO (-flto -flto-fat-objects), but then code size increased even more for both GCC 4.8.3 and GCC 4.9.0, I was expecting a code decrease though. Sorry I cannot share exact sources used for compilation here, I can share toolchain build script though on request, or try to setup a small test case. I first just wanted to confirm that this bug really is fixed and resolved, so its not a new bug or another known issue. BR /Fredrik On 06/12/14 08:46, fredrik.hederstierna@securitas-direct.com wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=59535 > > Fredrik Hederstierna <fredrik.hederstierna@securitas-direct.com> changed: > > What |Removed |Added > ---------------------------------------------------------------------------- > CC| |fredrik.hederstierna@securi > | |tas-direct.com > > --- Comment #18 from Fredrik Hederstierna <fredrik.hederstierna@securitas-direct.com> --- > I compared GCC 4.8.3 and GCC 4.9.0 for arm-none-eabi, and I still see a code > size increase for thumb1 (and thumb2) for both my arm966e and my cortex-m4 > targets. > > GCC 4.8.3 > RAM used 93812 > Flash used 515968 > > GCC 4.9.0 > RAM used 93812 (same) > Flash used 522608 (+1.3%) > > Then I tried to disable LRA and results got better: > > GCC 4.9.0 : added flag "-mno-lra" > RAM used 93812 (same) > Flash used 519624 (+0.7%) > > Flags used are otherwise identical for both tests: > > -Os -g3 -ggdb3 -gdwarf-4 > -fvar-tracking-assignments -fverbose-asm -fno-common -ffunction-sections > -fdata-sections -fno-exceptions -fno-asynchronous-unwind-tables > -fno-unwind-tables > -mthumb -mcpu=arm966e-s -msoft-float -mno-unaligned-access > > Generally GCC 4.9.0 seems to produce larger code, I tried to experiement with > LTO (-flto -flto-fat-objects), but then code size increased even more for both > GCC 4.8.3 and GCC 4.9.0, I was expecting a code decrease though. > > Sorry I cannot share exact sources used for compilation here, I can share > toolchain build script though on request, or try to setup a small test case. > > I first just wanted to confirm that this bug really is fixed and resolved, so > its not a new bug or another known issue. It might be another issue or it may well be an issue with LRA not many could tell for certain unless we could get a small testcase to look at. What we'd like is a small testcase that shows the problem compared with gcc 4.8.3 to progress this further. Please file a new bug report following the instructions in https://gcc.gnu.org/bugs/#report in this particular case we'd be interested in all command line options that were used. regards Ramana > > BR /Fredrik > Here is a small case to show lra introduces one more register copy (tested with trunk and 4.9). int isascii (int c) { return c >= 0 && c < 128; } With options: -Os -mthumb -mcpu=cortex-m0, I got isascii: mov r3, #0 mov r2, #127 mov r1, r3 //??? cmp r2, r0 adc r1, r1, r3 mov r0, r1 bx lr With options: -Os -mthumb -mcpu=cortex-m0 -mno-lra, I got isascii: mov r2, #127 mov r3, #0 cmp r2, r0 adc r3, r3, r3 mov r0, r3 bx lr I filed this previously, maybe its duplicate Bug 61578 - Code size increase for ARM thumb compared to 4.8.x when compiling with -Os BR Fredrik |