Bug 59535 - [4.9 regression] -Os code size regressions for Thumb1/Thumb2 with LRA
Summary: [4.9 regression] -Os code size regressions for Thumb1/Thumb2 with LRA
Status: RESOLVED FIXED
Alias: None
Product: gcc
Classification: Unclassified
Component: rtl-optimization (show other bugs)
Version: 4.9.0
: P1 normal
Target Milestone: 4.9.0
Assignee: Vladimir Makarov
URL:
Keywords: missed-optimization, ra
Depends on:
Blocks:
 
Reported: 2013-12-17 09:30 UTC by Richard Earnshaw
Modified: 2014-09-03 07:45 UTC (History)
6 users (show)

See Also:
Host:
Target: arm
Build:
Known to work:
Known to fail:
Last reconfirmed: 2013-12-19 00:00:00


Attachments
testcase (13.70 KB, text/plain)
2013-12-17 11:14 UTC, Richard Earnshaw
Details
Another testcase (73.55 KB, text/plain)
2013-12-17 13:08 UTC, Richard Earnshaw
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Richard Earnshaw 2013-12-17 09:30:40 UTC
Running the CSiBE benchmark for -Os with Thumb1 and Thumb2 code generation shows significant regressions since 11/12/13 (the day LRA was turned on by default for ARM).  These cause code size to grow back to the size it was in 2011.

I'll try to find some simple examples to use as test cases.
Comment 1 Richard Earnshaw 2013-12-17 09:31:59 UTC
CSiBE code size results for Thumb2
2013/12/09 2543786
2013/12/11 2563522
Comment 2 Richard Earnshaw 2013-12-17 09:53:37 UTC
CSiBE code size results for Thumb1

2013/12/09 2634640
2013/12/11 2683980

=1.8% size regression.
Comment 3 Richard Earnshaw 2013-12-17 11:12:44 UTC
It seems that one of the major problems is a significant increase in the number of mov instructions to copy registers from one location to another.  This is probably due to the two-operand format instructions we have in Thumb.
Comment 4 Richard Earnshaw 2013-12-17 11:14:50 UTC
Created attachment 31455 [details]
testcase

Compile with -Os -mthumb -mcpu=arm7tdmi -fno-short-enums and either -mlra or -mno-lra
Comment 5 Richard Earnshaw 2013-12-17 11:22:02 UTC
Number of register-register move operations in the testcase
lra:    208
no-lra: 105
Comment 6 Richard Earnshaw 2013-12-17 13:08:10 UTC
Created attachment 31457 [details]
Another testcase

Another testcase, but this one has some obvious examples of poor behaviour for -Os.

In addtion to the options used on the previous case, this might need
-fno-strict-aliasing -fno-common -fomit-frame-pointer -fno-strength-reduce 

Example one, spilling a value and then keeping a copy in a hard reg over a call.

	mov	r5, r1          <= R1 copied to R5
	sub	sp, sp, #28
	str	r1, [sp, #8]    <= And spilled to the stack
	mov	r2, #12
	mov	r1, #0
	mov	r4, r0
	bl	memset
	mov	r3, #2
	mov	r2, r5          <= Could reload from the stack instead

Example two, use of multiple reloads to use high register:

	ldr	r3, [sp, #4]
	mov	ip, r3          <= Copying value into high register
	add	ip, ip, r5      <= Arithmetic
	mov	r3, ip          <= Copying result back to original register
	str	r3, [sp, #4]
	ldr	r3, [sp, #12]
	mov	ip, r3          <= And IP is dead anyway...

In this case, 
	mov	ip, r3
	add	ip, ip, r5
	mov	r3, ip

can be replaced entirely with
	add	r3, r5
saving two completely unnecessary MOV instructions.

Third, related case,

	mov	r1, #12
	mov	ip, r1
	add	ip, ip, r4
	mov	r1, ip

Could be done either as
	mov	r1, #12
	add     r1, r4
	mov	ip, r1
or
	mov	r1, r4
	add	r1, #12
	mov	ip, r1

both saving one instruction, or even two if the value doesn't really need copying to a high reg.
Comment 7 Yvan Roux 2013-12-17 22:47:53 UTC
I'm not able to reproduce with today's trunk or rev #205887 built for arm-none-linux-gnueabi, as the compiler ICEs in assign_by_spills (lra-assign.c). What is your configuration ?
Comment 8 Yvan Roux 2013-12-18 09:29:11 UTC
Forget my previous comment, I can reproduce it after a rebase.
Comment 9 Vladimir Makarov 2013-12-18 16:06:26 UTC
(In reply to Richard Earnshaw from comment #5)
> Number of register-register move operations in the testcase
> lra:    208
> no-lra: 105

That is huge degradation.  And I guess it is not only because of 2-ops insns.  It works adequate on x86/x86-64.  Although I know some issues with 2-ops commutative insns (I am trying to fix it too).

I think major problem is in wrong alternative choices as thumb lo/hi reg usage is complicated.

I take this bug very seriously.  If I cannot fix it till end of Jan (sorry, it is a vacation time), probably we should switch to reload pass for thumb.
Comment 10 Vladimir Makarov 2013-12-18 22:14:02 UTC
./cc1 -Os -mthumb -mcpu=arm7tdmi -fno-short-enums test.i -m{no-}lra -fno-schedule-insns2

        original reload  reload with change  lra with change
reg moves       104             86              101
all moves       210             192             207
loads           561             569             541
stores          234             231             210

overall        1005             992             958
mov/ldr/str

  First of all, I can not compile the test with LRA because of its
crash.  The problem is in choosing alternatives different from reload.
For some insn alternatives, LRA needs two hi regs for reloads.  But
there is only one 12, all others hi-regs are fixed.

 To be honest, I don't know why 12 is not fixed.  It results in using
12 by IRA and bigger code when even reload is used.  I believe it
should fixed too.

 Second, big number of moves is not a bad thing.  It should considered
with # ld/st.  LRA generates 15 more reg moves but 28 less loads and
21 less stores.  It means LRA with the patch generates smaller and
faster code than reload.

Here is the patch improving code size for reload and fixing LRA crash
and improving code for LRA.

  If arm part of the patch is ok, please let me know.  I'll commit the
patch then.  The rest of the patch does not affect the test code
generation but it might affect other code generated with -Os.

Index: config/arm/arm.c
===================================================================
--- config/arm/arm.c    (revision 206089)
+++ config/arm/arm.c    (working copy)
@@ -29236,6 +29236,7 @@ arm_conditional_register_usage (void)
       for (regno = FIRST_HI_REGNUM;
           regno <= LAST_HI_REGNUM; ++regno)
        fixed_regs[regno] = call_used_regs[regno] = 1;
+      fixed_regs[12] = call_used_regs[12] = 1;
     }

   /* The link register can be clobbered by any branch insn,
Index: lra-assigns.c
===================================================================
--- lra-assigns.c       (revision 206089)
+++ lra-assigns.c       (working copy)
@@ -612,7 +612,9 @@ find_hard_regno_for (int regno, int *cos
                && ! df_regs_ever_live_p (hard_regno + j))
              /* It needs save restore.  */
              hard_regno_costs[hard_regno]
-               += 2 * ENTRY_BLOCK_PTR_FOR_FN (cfun)->next_bb->frequency + 1;
+               += (2
+                   * REG_FREQ_FROM_BB (ENTRY_BLOCK_PTR_FOR_FN (cfun)->next_bb)
+                   + 1);
          priority = targetm.register_priority (hard_regno);
          if (best_hard_regno < 0 || hard_regno_costs[hard_regno] < best_cost
              || (hard_regno_costs[hard_regno] == best_cost
Index: lra-coalesce.c
===================================================================
--- lra-coalesce.c      (revision 206089)
+++ lra-coalesce.c      (working copy)
@@ -79,8 +79,8 @@ move_freq_compare_func (const void *v1p,
   rtx mv2 = *(const rtx *) v2p;
   int pri1, pri2;

-  pri1 = BLOCK_FOR_INSN (mv1)->frequency;
-  pri2 = BLOCK_FOR_INSN (mv2)->frequency;
+  pri1 = REG_FREQ_FROM_BB (BLOCK_FOR_INSN (mv1));
+  pri2 = REG_FREQ_FROM_BB (BLOCK_FOR_INSN (mv2));
   if (pri2 - pri1)
     return pri2 - pri1;

@@ -277,7 +277,7 @@ lra_coalesce (void)
            fprintf
              (lra_dump_file, "      Coalescing move %i:r%d-r%d (freq=%d)\n",
               INSN_UID (mv), sregno, dregno,
-              BLOCK_FOR_INSN (mv)->frequency);
+              REG_FREQ_FROM_BB (BLOCK_FOR_INSN (mv)));
          /* We updated involved_insns_bitmap when doing the merge.  */
        }
       else if (!(lra_intersected_live_ranges_p
@@ -291,7 +291,7 @@ lra_coalesce (void)
               "  Coalescing move %i:r%d(%d)-r%d(%d) (freq=%d)\n",
               INSN_UID (mv), sregno, ORIGINAL_REGNO (SET_SRC (set)),
               dregno, ORIGINAL_REGNO (SET_DEST (set)),
-              BLOCK_FOR_INSN (mv)->frequency);
+              REG_FREQ_FROM_BB (BLOCK_FOR_INSN (mv)));
          bitmap_ior_into (&involved_insns_bitmap,
                           &lra_reg_info[sregno].insn_bitmap);
          bitmap_ior_into (&involved_insns_bitmap,
@@ -316,7 +316,8 @@ lra_coalesce (void)
                /* Coalesced move.  */
                if (lra_dump_file != NULL)
                  fprintf (lra_dump_file, "      Removing move %i (freq=%d)\n",
-                        INSN_UID (insn), BLOCK_FOR_INSN (insn)->frequency);
+                          INSN_UID (insn),
+                          REG_FREQ_FROM_BB (BLOCK_FOR_INSN (insn)));
                lra_set_insn_deleted (insn);
              }
          }
Index: lra-constraints.c
===================================================================
--- lra-constraints.c   (revision 206089)
+++ lra-constraints.c   (working copy)
@@ -4077,7 +4077,7 @@ lra_constraints (bool first_p)
                      fprintf (lra_dump_file,
                               "      Removing equiv init insn %i (freq=%d)\n",
                               INSN_UID (curr_insn),
-                              BLOCK_FOR_INSN (curr_insn)->frequency);
+                              REG_FREQ_FROM_BB (BLOCK_FOR_INSN (curr_insn)));
                      dump_insn_slim (lra_dump_file, curr_insn);
                    }
                  if (contains_reg_p (x, true, false))
Comment 11 Ramana Radhakrishnan 2013-12-19 09:26:00 UTC
(In reply to Vladimir Makarov from comment #9)
> (In reply to Richard Earnshaw from comment #5)

> I think major problem is in wrong alternative choices as thumb lo/hi reg
> usage is complicated.

That is probably going to be the reason.

> 
> I take this bug very seriously.  If I cannot fix it till end of Jan (sorry,
> it is a vacation time), probably we should switch to reload pass for thumb.

Thanks very much for taking this so seriously.


> 
>  To be honest, I don't know why 12 is not fixed.  It results in using
> 12 by IRA and bigger code when even reload is used.  I believe it
> should fixed too.


According to the ABI r12 is a caller saved register. Am I correct in understanding that we are taking out a caller-saved register to allow for lra to be able to choose other low regs and thereby prevent movements to and from r12 ? 

Taking out r12 will have performance implications, so any change here needs to be benchmarked very carefully on all cores with Thumb1 and Thumb2 to make sure that the impact. Additionally we shouldn't be needing to do this in ARM state, so this patchlet as it stands is not enough.


regards
Ramana
Comment 12 Ramana Radhakrishnan 2013-12-19 09:54:21 UTC
(In reply to Ramana Radhakrishnan from comment #11)
> (In reply to Vladimir Makarov from comment #9)
> > (In reply to Richard Earnshaw from comment #5)
> 
> > I think major problem is in wrong alternative choices as thumb lo/hi reg
> > usage is complicated.
> 
> That is probably going to be the reason.
> 
> > 
> > I take this bug very seriously.  If I cannot fix it till end of Jan (sorry,
> > it is a vacation time), probably we should switch to reload pass for thumb.
> 
> Thanks very much for taking this so seriously.
> 
> 
> > 
> >  To be honest, I don't know why 12 is not fixed.  It results in using
> > 12 by IRA and bigger code when even reload is used.  I believe it
> > should fixed too.
> 
> 
> According to the ABI r12 is a caller saved register. Am I correct in
> understanding that we are taking out a caller-saved register to allow for
> lra to be able to choose other low regs and thereby prevent movements to and
> from r12 ? 
> 

Scratch that  : it's already for t16 and Os. Changing LAST_HI_REGNUM to 14 in arm.h should be enough.

regards
Ramana
Comment 13 Richard Earnshaw 2013-12-19 17:12:24 UTC
The original reason we took most high registers out of the available registers list for -Os is because saving them (they're mostly callee-saved) is quite expensive -- they have to be copied to low registers first.  IP was the exception, since it's call-clobbered and there's thus no code cost to make it available.

While I'm happy to take IP out of the available registers list for now as a short-term expedient work-around, I see this as being indicative a more fundamental problem.

When not optimizing for size, the register allocator should be able to use the high registers as alternatives to spill slots (copy the value to a high register rather than spill it to the stack).  Once we have that, it would make sense for the few cases where high registers are permitted to legitimately use the copy that is up there rather than moving it back again to a low register first, so I wouldn't want to change the insn patterns to remove high registers entirely from operations like add, sub and compare.

On the other hand, LRA does need to be aware that once a value is in a high register, really spilling it to memory, or loading it from memory is an expensive operation, since such transfers have to go via a low register.
Comment 14 Vladimir Makarov 2013-12-19 21:42:17 UTC
(In reply to Richard Earnshaw from comment #13)

> When not optimizing for size, the register allocator should be able to use
> the high registers as alternatives to spill slots (copy the value to a high
> register rather than spill it to the stack).  Once we have that, it would
> make sense for the few cases where high registers are permitted to
> legitimately use the copy that is up there rather than moving it back again
> to a low register first, so I wouldn't want to change the insn patterns to
> remove high registers entirely from operations like add, sub and compare.
> 

I'll continue my work on this problem in 2 weeks (after my vacation). I'd like to implement the proposed spilling into high regs and try to tune choosing insn alternatives closer to reload (but it is very sensitive area as it affects code generation for other LRA targets).

> On the other hand, LRA does need to be aware that once a value is in a high
> register, really spilling it to memory, or loading it from memory is an
> expensive operation, since such transfers have to go via a low register.
Comment 15 Richard Earnshaw 2014-01-03 14:43:58 UTC
Another testcase where the thumb1 code is poor is gcc.c-torture/execute/pr28982b.c

With LRA we often get sequences such as:

        mov     r3, sp
        ldr     r2, .L8+16
        add     r3, r3, r2         // r2 dead.
        str     r0, [r3]

Instead of:

        ldr     r2, .L8+16
        add     r2, r2, sp
        str     r0, [r2]

Which is both shorter and needs fewer registers.
Comment 16 Vladimir Makarov 2014-02-14 16:19:03 UTC
Author: vmakarov
Date: Fri Feb 14 16:18:29 2014
New Revision: 207787

URL: http://gcc.gnu.org/viewcvs?rev=207787&root=gcc&view=rev
Log:
2014-02-14  Vladimir Makarov  <vmakarov@redhat.com>
	    Richard Earnshaw  <rearnsha@arm.com>

	PR rtl-optimization/59535
	* lra-constraints.c (process_alt_operands): Encourage alternative
	when unassigned pseudo class is superset of the alternative class.
	(inherit_reload_reg): Don't inherit when optimizing for code size.
	* config/arm/arm.h (MODE_BASE_REG_CLASS): Add version for LRA
	returning CORE_REGS for anything but Thumb1 and BASE_REGS for
	modes not less than 4 for Thumb1.


Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/config/arm/arm.h
    trunk/gcc/lra-constraints.c
Comment 17 Jeffrey A. Law 2014-02-14 20:32:40 UTC
Should be fixed by Vlad's trunk commit.
Comment 18 Fredrik Hederstierna 2014-06-12 07:46:53 UTC
I compared GCC 4.8.3 and GCC 4.9.0 for arm-none-eabi, and I still see a code size increase for thumb1 (and thumb2) for both my arm966e and my cortex-m4 targets.

GCC 4.8.3
  RAM used     93812 
  Flash used   515968

GCC 4.9.0
  RAM used     93812 (same)
  Flash used   522608 (+1.3%)

Then I tried to disable LRA and results got better:

GCC 4.9.0 : added flag "-mno-lra"
  RAM used     93812 (same)
  Flash used   519624 (+0.7%)

Flags used are otherwise identical for both tests:

  -Os -g3 -ggdb3 -gdwarf-4
  -fvar-tracking-assignments -fverbose-asm -fno-common -ffunction-sections   -fdata-sections -fno-exceptions -fno-asynchronous-unwind-tables -fno-unwind-tables
  -mthumb -mcpu=arm966e-s -msoft-float -mno-unaligned-access

Generally GCC 4.9.0 seems to produce larger code, I tried to experiement with LTO (-flto -flto-fat-objects), but then code size increased even more for both GCC 4.8.3 and GCC 4.9.0, I was expecting a code decrease though.

Sorry I cannot share exact sources used for compilation here, I can share toolchain build script though on request, or try to setup a small test case.

I first just wanted to confirm that this bug really is fixed and resolved, so its not a new bug or another known issue.

BR /Fredrik
Comment 19 ramana.radhakrishnan@arm.com 2014-06-13 11:44:34 UTC
On 06/12/14 08:46, fredrik.hederstierna@securitas-direct.com wrote:
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=59535
>
> Fredrik Hederstierna <fredrik.hederstierna@securitas-direct.com> changed:
>
>             What    |Removed                     |Added
> ----------------------------------------------------------------------------
>                   CC|                            |fredrik.hederstierna@securi
>                     |                            |tas-direct.com
>
> --- Comment #18 from Fredrik Hederstierna <fredrik.hederstierna@securitas-direct.com> ---
> I compared GCC 4.8.3 and GCC 4.9.0 for arm-none-eabi, and I still see a code
> size increase for thumb1 (and thumb2) for both my arm966e and my cortex-m4
> targets.
>
> GCC 4.8.3
>    RAM used     93812
>    Flash used   515968
>
> GCC 4.9.0
>    RAM used     93812 (same)
>    Flash used   522608 (+1.3%)
>
> Then I tried to disable LRA and results got better:
>
> GCC 4.9.0 : added flag "-mno-lra"
>    RAM used     93812 (same)
>    Flash used   519624 (+0.7%)
>
> Flags used are otherwise identical for both tests:
>
>    -Os -g3 -ggdb3 -gdwarf-4
>    -fvar-tracking-assignments -fverbose-asm -fno-common -ffunction-sections
> -fdata-sections -fno-exceptions -fno-asynchronous-unwind-tables
> -fno-unwind-tables
>    -mthumb -mcpu=arm966e-s -msoft-float -mno-unaligned-access
>
> Generally GCC 4.9.0 seems to produce larger code, I tried to experiement with
> LTO (-flto -flto-fat-objects), but then code size increased even more for both
> GCC 4.8.3 and GCC 4.9.0, I was expecting a code decrease though.
>
> Sorry I cannot share exact sources used for compilation here, I can share
> toolchain build script though on request, or try to setup a small test case.
>
> I first just wanted to confirm that this bug really is fixed and resolved, so
> its not a new bug or another known issue.

It might be another issue or it may well be an issue with LRA not many 
could tell for certain unless we could get a small testcase to look at.

What we'd like is a small testcase that shows the problem compared with 
gcc 4.8.3 to progress this further.

Please file a new bug report following the instructions in 
https://gcc.gnu.org/bugs/#report

in this particular case we'd be interested in all command line options 
that were used.


regards
Ramana


>
> BR /Fredrik
>
Comment 20 Zhenqiang Chen 2014-09-03 06:17:44 UTC
Here is a small case to show lra introduces one more register copy (tested with trunk and 4.9).

int isascii (int c)
{
  return c >= 0 && c < 128;
}
With options: -Os -mthumb -mcpu=cortex-m0, I got

isascii:
	mov	r3, #0
	mov	r2, #127
	mov	r1, r3   //???
	cmp	r2, r0
	adc	r1, r1, r3
	mov	r0, r1
	bx	lr

With options: -Os -mthumb -mcpu=cortex-m0 -mno-lra, I got

isascii:
	mov	r2, #127
	mov	r3, #0
	cmp	r2, r0
	adc	r3, r3, r3
	mov	r0, r3
	bx	lr
Comment 21 Fredrik Hederstierna 2014-09-03 07:45:29 UTC
I filed this previously, maybe its duplicate

Bug 61578 - Code size increase for ARM thumb compared to 4.8.x when compiling with -Os

BR Fredrik