[PATCH] arm: Fix multiple inheritance thunks for thumb-1 with -mpure-code

Wed Oct 28 18:10:29 GMT 2020

On Wed, 28 Oct 2020 at 18:44, Richard Earnshaw
<Richard.Earnshaw@foss.arm.com> wrote:
>
> On 27/10/2020 15:42, Richard Earnshaw via Gcc-patches wrote:
> > On 26/10/2020 10:52, Christophe Lyon via Gcc-patches wrote:
> >> On Thu, 22 Oct 2020 at 17:22, Richard Earnshaw
> >> <Richard.Earnshaw@foss.arm.com> wrote:
> >>>
> >>> On 22/10/2020 09:45, Christophe Lyon via Gcc-patches wrote:
> >>>> On Wed, 21 Oct 2020 at 19:36, Richard Earnshaw
> >>>> <Richard.Earnshaw@foss.arm.com> wrote:
> >>>>>
> >>>>> On 21/10/2020 17:11, Christophe Lyon via Gcc-patches wrote:
> >>>>>> On Wed, 21 Oct 2020 at 18:07, Richard Earnshaw
> >>>>>> <Richard.Earnshaw@foss.arm.com> wrote:
> >>>>>>>
> >>>>>>> On 21/10/2020 16:49, Christophe Lyon via Gcc-patches wrote:
> >>>>>>>> On Tue, 20 Oct 2020 at 13:25, Richard Earnshaw
> >>>>>>>> <Richard.Earnshaw@foss.arm.com> wrote:
> >>>>>>>>>
> >>>>>>>>> On 20/10/2020 12:22, Richard Earnshaw wrote:
> >>>>>>>>>> On 19/10/2020 17:32, Christophe Lyon via Gcc-patches wrote:
> >>>>>>>>>>> On Mon, 19 Oct 2020 at 16:39, Richard Earnshaw
> >>>>>>>>>>> <Richard.Earnshaw@foss.arm.com> wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>> On 12/10/2020 08:59, Christophe Lyon via Gcc-patches wrote:
> >>>>>>>>>>>>> On Thu, 8 Oct 2020 at 11:58, Richard Earnshaw
> >>>>>>>>>>>>> <Richard.Earnshaw@foss.arm.com> wrote:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> On 08/10/2020 10:07, Christophe Lyon via Gcc-patches wrote:
> >>>>>>>>>>>>>>> On Tue, 6 Oct 2020 at 18:02, Richard Earnshaw
> >>>>>>>>>>>>>>> <Richard.Earnshaw@foss.arm.com> wrote:
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> On 29/09/2020 20:50, Christophe Lyon via Gcc-patches wrote:
> >>>>>>>>>>>>>>>>> When mi_delta is > 255 and -mpure-code is used, we cannot load delta
> >>>>>>>>>>>>>>>>> from code memory (like we do without -mpure-code).
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> This patch builds the value of mi_delta into r3 with a series of
> >>>>>>>>>>>>>>>>> movs/adds/lsls.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> We also do some cleanup by not emitting the function address and delta
> >>>>>>>>>>>>>>>>> via .word directives at the end of the thunk since we don't use them
> >>>>>>>>>>>>>>>>> with -mpure-code.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> No need for new testcases, this bug was already identified by
> >>>>>>>>>>>>>>>>> eg. pr46287-3.C
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> 2020-09-29  Christophe Lyon  <christophe.lyon@linaro.org>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>       gcc/
> >>>>>>>>>>>>>>>>>       * config/arm/arm.c (arm_thumb1_mi_thunk): Build mi_delta in r3 and
> >>>>>>>>>>>>>>>>>       do not emit function address and delta when -mpure-code is used.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Hi Richard,
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Thanks for your comments.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> There are some optimizations you can make to this code.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Firstly, for values between 256 and 510 (inclusive), it would be better
> >>>>>>>>>>>>>>>> to just expand a mov of 255 followed by an add.
> >>>>>>>>>>>>>>> I now see the splitted for the "Pe" constraint which I hadn't noticed
> >>>>>>>>>>>>>>> before, so I can write something similar indeed.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> However, I'm note quite sure to understand the benefit in the split
> >>>>>>>>>>>>>>> when -mpure-code is NOT used.
> >>>>>>>>>>>>>>> Consider:
> >>>>>>>>>>>>>>> int f3_1 (void) { return 510; }
> >>>>>>>>>>>>>>> int f3_2 (void) { return 511; }
> >>>>>>>>>>>>>>> Compile with -O2 -mcpu=cortex-m0:
> >>>>>>>>>>>>>>> f3_1:
> >>>>>>>>>>>>>>>         movs    r0, #255
> >>>>>>>>>>>>>>>         lsls    r0, r0, #1
> >>>>>>>>>>>>>>>         bx      lr
> >>>>>>>>>>>>>>> f3_2:
> >>>>>>>>>>>>>>>         ldr     r0, .L4
> >>>>>>>>>>>>>>>         bx      lr
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> The splitter makes the code bigger, does it "compensate" for this by
> >>>>>>>>>>>>>>> not having to load the constant?
> >>>>>>>>>>>>>>> Actually the constant uses 4 more bytes, which should be taken into
> >>>>>>>>>>>>>>> account when comparing code size,
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Yes, the size of the literal pool entry needs to be taken into account.
> >>>>>>>>>>>>>>  It might happen that the entry could be shared with another use of that
> >>>>>>>>>>>>>> literal, but in general that's rare.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> so f3_1 uses 6 bytes, and f3_2 uses 8, so as you say below three
> >>>>>>>>>>>>>>> thumb1 instructions would be equivalent in size compared to loading
> >>>>>>>>>>>>>>> from the literal pool. Should the 256-510 range be extended?
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> It's a bit borderline at three instructions when literal pools are not
> >>>>>>>>>>>>>> expensive to use, but in thumb1 literal pools tend to be quite small due
> >>>>>>>>>>>>>> to the limited pc offsets we can use.  I think on balance we probably
> >>>>>>>>>>>>>> want to use the instruction sequence unless optimizing for size.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> This is also true for
> >>>>>>>>>>>>>>>> the literal pools alternative as well, so should be handled before all
> >>>>>>>>>>>>>>>> this.
> >>>>>>>>>>>>>>> I am not sure what you mean: with -mpure-code, the above sample is compiled as:
> >>>>>>>>>>>>>>> f3_1:
> >>>>>>>>>>>>>>>         movs    r0, #255
> >>>>>>>>>>>>>>>         lsls    r0, r0, #1
> >>>>>>>>>>>>>>>         bx      lr
> >>>>>>>>>>>>>>> f3_2:
> >>>>>>>>>>>>>>>         movs    r0, #1
> >>>>>>>>>>>>>>>         lsls    r0, r0, #8
> >>>>>>>>>>>>>>>         adds    r0, r0, #255
> >>>>>>>>>>>>>>>         bx      lr
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> so the "return 510" case is already handled as without -mpure-code.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> I was thinking specifically of the thunk sequence where you seem to be
> >>>>>>>>>>>>>> emitting instructions directly rather than generating RTL.  The examples
> >>>>>>>>>>>>>> you show here are not thunks.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>> OK thanks for the clarification.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Here is an updated version, split into 3 patches to hopefully make
> >>>>>>>>>>>>> review easier.
> >>>>>>>>>>>>> They apply on top of my other mpure-code patches for PR96967 and PR96770:
> >>>>>>>>>>>>> https://gcc.gnu.org/pipermail/gcc-patches/2020-September/554956.html
> >>>>>>>>>>>>> https://gcc.gnu.org/pipermail/gcc-patches/2020-September/554957.html
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> I kept it this way to make incremental changes easier to understand.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Patch 1: With the hope to avoid confusion and make maintenance easier,
> >>>>>>>>>>>>> I have updated thumb1_gen_const_int() so that it can generate either RTL or
> >>>>>>>>>>>>> asm. This way, all the code used to build thumb-1 constants is in the
> >>>>>>>>>>>>> same place,
> >>>>>>>>>>>>>  in case we need to improve/fix it later. We now generate shorter sequences in
> >>>>>>>>>>>>> several cases matching your comments.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Patch 2: Removes the equivalent loop from thumb1_movsi_insn pattern and
> >>>>>>>>>>>>> calls thumb1_gen_const_int.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Patch 3: Update of the original patch in this thread, now calls
> >>>>>>>>>>>>> thumb1_gen_const_int.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Yuk!  Those changes to thumb1_gen_const_int are horrible.
> >>>>>>>>>>>>
> >>>>>>>>>>>> I think we should be able to leverage the fact that the compiler can use
> >>>>>>>>>>>> C++ now to do much better than that, for example by making that function
> >>>>>>>>>>>> a template.  For example (and this is just a sketch):
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> Indeed! I didn't think about it since there is no other use of
> >>>>>>>>>>> templates in arm.c yet.
> >>>>>>>>>>> I'll send an update soon.
> >>>>>>>>>>>
> >>>>>>>>>>> Other than that, does the approach look OK to you?
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> Yes, I think this is heading in the right direction.  Bringing the two
> >>>>>>>>>> immediate generating operations into a single function can only be a
> >>>>>>>>>> good thing.
> >>>>>>>>>>
> >>>>>>>>>> Looking again at your example constant sequences, I see:
> >>>>>>>>>>
> >>>>>>>>>> 0x1000010:
> >>>>>>>>>>         movs    r3, #16
> >>>>>>>>>>         lsls    r3, #16
> >>>>>>>>>>         adds    r3, #1
> >>>>>>>>>>         lsls    r3, #4
> >>>>>>>>>> 0x1000011:
> >>>>>>>>>>         movs    r3, #1
> >>>>>>>>>>         lsls    r3, #24
> >>>>>>>>>>         adds    r3, #17
> >>>>>>>>>>
> >>>>>>>>>> The first of these looks odd, given the second sequence.  Why doesn't
> >>>>>>>>>> the first expand to
> >>>>>>>>>>
> >>>>>>>>>> 0x1000010:
> >>>>>>>>>>         movs    r3, #16
> >>>>>>>>>>         lsls    r3, #16
> >>>>>>>>>>         adds    r3, #16
> >>>>>>>>>>
> >>>>>>>>> Err, I mean to:
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> 0x1000010:
> >>>>>>>>>         movs    r3, #1
> >>>>>>>>>         lsls    r3, #24
> >>>>>>>>>         adds    r3, #16
> >>>>>>>>>
> >>>>>>>>> ?
> >>>>>>>>
> >>>>>>>> Because I first try to right-shift the constant, hoping to reduce its
> >>>>>>>> range and need less instructions to build the higher part, then
> >>>>>>>> left-shift back.
> >>>>>>>>
> >>>>>>>> In this particular case, we'd need to realize that there are many
> >>>>>>>> zeros "inside" the constant.
> >>>>>>>>
> >>>>>>>> If I remove the part that tries to reduce the range, I do get that
> >>>>>>>> sequence, but for 764 I now generate
> >>>>>>>> movs    r3, #2
> >>>>>>>> lsls    r3, #8
> >>>>>>>> adds    r3, #252
> >>>>>>>> instead of
> >>>>>>>> movs    r3, #191
> >>>>>>>> lsls    r3, #2
> >>>>>>>>
> >>>>>>>> A possibility would be to try both approaches and keep the shortest one.
> >>>>>>>
> >>>>>>> Lets leave that for now, it's not important to fixing the main issue;
> >>>>>>> but we should remember we need to come back to it at some point.
> >>>>>>>
> >>>>>> Thanks, that's what I was thinking too.
> >>>>>>
> >>>>>>> There are other tricks as well, such as
> >>>>>>>
> >>>>>>>   0xffffff
> >>>>>>>
> >>>>>>> can be done as
> >>>>>>>
> >>>>>>>   0x1000000 - 1
> >>>>>>>
> >>>>>>> and
> >>>>>>>
> >>>>>>>   0xfffffd
> >>>>>>>
> >>>>>>> as
> >>>>>>>
> >>>>>>>   0x1000000 - 3
> >>>>>>>
> >>>>>>> but these can wait as well.
> >>>>>>>
> >>>>>>
> >>>>>> Didn't we already need to handle such tricks? I'm surprised this
> >>>>>> wasn't needed earlier.
> >>>>>>
> >>>>>
> >>>>> I don't think we ever worried about them.  Most of them need at least 3
> >>>>> instructions so aren't a code size saving over using a literal pool entry.
> >>>>>
> >>>> OK, this will also help when using -mslow-flash-data.
> >>>>
> >>>> Here are updated patches, now using a template as you suggested.
> >>>
> >>> Looking better, but when I try to apply this to my local tree patch 2
> >>> fails (I'm not exactly sure why, what was your baseline for these
> >>> patches?)
> >> I have the tree patches in this thread on top of these other two:
> >> https://gcc.gnu.org/pipermail/gcc-patches/2020-October/556768.html
> >> https://gcc.gnu.org/pipermail/gcc-patches/2020-October/556769.html
> >>
> >> They have gradual improvements to thumb1_movsi_insn.
> >>
> >>> -- that patch looks suspicious anyway, you're replacing code
> >>> that prints out assembly with code that generates RTL.
> >> Right! I took me a while to understand how I could miss this, sorry.
> >> That was caused by improper testing, as this part of the code
> >> isn't used when targetting cortex-m0. I have added a testcase
> >> for cortex-m23 which crashes with the previous version of patch 2,
> >> and succeeds now.
> >>
> >>> Could you also rename t1_print and t1_rtl to thumb1_const_print and
> >>> thumb1_const_rtl.  I think the names as they stand are likely to be too
> >>> generic.
> >> OK, done.
> >>
> >> How about this new version?
> >> I'm not yet sure about the most appropriate naming for:
> >> thumb1_gen_const_int
> >> thumb1_gen_const_int_asm
> >> should they be
> >> thumb1_gen_const_int_rtl
> >> thumb1_gen_const_int_print
> >> to be consistent with the new classes?
> >
> > It would probably be better, yes.
> >
> > More detailed comments below.
> >
> > R.
> >
> >>
> >> Thanks,
> >>
> >> Christophe
> >>
> >>> R.
> >>>
> >>>>
> >>>> Thanks,
> >>>>
> >>>> Christophe
> >>>>
> >>>>> R.
> >>>>>
> >>>>>>
> >>>>>>>
> >>>>>>> R.
> >>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> R.
> >>>>>>>>>>
> >>>>>>>>>>> Thanks,
> >>>>>>>>>>>
> >>>>>>>>>>> Christophe
> >>>>>>>>>>>
> >>>>>>>>>>>> class t1_rtl
> >>>>>>>>>>>> {
> >>>>>>>>>>>>  public:
> >>>>>>>>>>>>   void ashift(int a) { gen_rtx_ASHIFT(a); }
> >>>>>>>>>>>>   void rshift(int b) { gen_rtx_SHIFTRT(b); }
> >>>>>>>>>>>> };
> >>>>>>>>>>>>
> >>>>>>>>>>>> class t1_print
> >>>>>>>>>>>> {
> >>>>>>>>>>>>  public:
> >>>>>>>>>>>>   t1_print (FILE *f) : t_file(f) {}
> >>>>>>>>>>>>   void ashift (int a) { fprintf (t_file, "a shift %d\n", a); }
> >>>>>>>>>>>>   void rshift (int b) { fprintf (t_file, "r shift %d\n", b); }
> >>>>>>>>>>>>  private:
> >>>>>>>>>>>>   FILE *t_file;
> >>>>>>>>>>>> };
> >>>>>>>>>>>>
> >>>>>>>>>>>> template <class T>
> >>>>>>>>>>>> void thumb1_gen_const_int(T t, int f)
> >>>>>>>>>>>> {
> >>>>>>>>>>>>   // Expansion of thumb1_gen_const_int ...
> >>>>>>>>>>>>   t.ashift(f);
> >>>>>>>>>>>> }
> >>>>>>>>>>>>
> >>>>>>>>>>>> // Usage...
> >>>>>>>>>>>> void f1()
> >>>>>>>>>>>> {
> >>>>>>>>>>>>   // Use the RTL expander
> >>>>>>>>>>>>   t1_rtl g;
> >>>>>>>>>>>>   thumb1_gen_const_int (g, 3);
> >>>>>>>>>>>> }
> >>>>>>>>>>>>
> >>>>>>>>>>>> void f2()
> >>>>>>>>>>>> {
> >>>>>>>>>>>>   // Use the printf expander writing to stdout
> >>>>>>>>>>>>   t1_print g(stdout);
> >>>>>>>>>>>>   thumb1_gen_const_int (g, 3);
> >>>>>>>>>>>> }
> >>>>>>>>>>>>
> >>>>>>>>>>>> With this you can write thumb1_gen_const_int without having to worry
> >>>>>>>>>>>> about which expander is being used in each instance and the template
> >>>>>>>>>>>> expansion will use the right version.
> >>>>>>>>>>>>
> >>>>>>>>>>>> R.
> >>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>  I also suspect (but haven't check) that the base adjustment will
> >>>>>>>>>>>>>>>> most commonly be a multiple of the machine word size (ie 4).  If that is
> >>>>>>>>>>>>>>>> the case then you could generate n/4 and then shift it left by 2 for an
> >>>>>>>>>>>>>>>> even greater range of literals.
> >>>>>>>>>>>>>>> I can see there is provision for this in the !TARGET_THUMB1_ONLY case,
> >>>>>>>>>>>>>>> I'll update my patch.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>  More generally, any sequence of up to
> >>>>>>>>>>>>>>>> three thumb1 instructions will be no larger, and probably as fast as the
> >>>>>>>>>>>>>>>> existing literal pool fall back.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Secondly, if the value is, for example, 65536 (0x10000), your code will
> >>>>>>>>>>>>>>>> emit a mov followed by two shift-by-8 instructions; the two shifts could
> >>>>>>>>>>>>>>>> be merged into a single shift-by-16.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Right, I'll try to make use of thumb_shiftable_const.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Finally, I'd really like to see some executable tests for this, if at
> >>>>>>>>>>>>>>>> all possible.
> >>>>>>>>>>>>>>> I mentioned pr46287-3.C, but that's not the only existing testcase
> >>>>>>>>>>>>>>> that showed the problem. There are also:
> >>>>>>>>>>>>>>> g++.dg/opt/thunk1.C
> >>>>>>>>>>>>>>> g++.dg/ipa/pr46984.C
> >>>>>>>>>>>>>>> g++.dg/torture/pr46287.C
> >>>>>>>>>>>>>>> g++.dg/torture/pr45699.C
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Do you want that I copy one of these in the arm subdir and add
> >>>>>>>>>>>>>>> -mpure-code in dg-options?
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> On reflection, probably not - that just makes things more complicated
> >>>>>>>>>>>>>> with all the dg-options mess (I'm worried about interactions with other
> >>>>>>>>>>>>>> sets of options on the command line and the fall-out from that).  If
> >>>>>>>>>>>>>> someone cares about pure-code they should be doing full testsuite runs
> >>>>>>>>>>>>>> with it enabled and that should be sufficient.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Yes, that's what I am doing manually, it's a bit tricky, and I use a
> >>>>>>>>>>>>> modified simulator.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Christophe
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> R.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Christophe
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> R.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> k#   (use "git pull" to merge the remote branch into yours)
> >>>>>>>>>>>>>>>>> ---
> >>>>>>>>>>>>>>>>>  gcc/config/arm/arm.c | 91 +++++++++++++++++++++++++++++++++++++---------------
> >>>>>>>>>>>>>>>>>  1 file changed, 66 insertions(+), 25 deletions(-)
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> diff --git a/gcc/config/arm/arm.c b/gcc/config/arm/arm.c
> >>>>>>>>>>>>>>>>> index ceeb91f..62abeb5 100644
> >>>>>>>>>>>>>>>>> --- a/gcc/config/arm/arm.c
> >>>>>>>>>>>>>>>>> +++ b/gcc/config/arm/arm.c
> >>>>>>>>>>>>>>>>> @@ -28342,9 +28342,43 @@ arm_thumb1_mi_thunk (FILE *file, tree, HOST_WIDE_INT delta,
> >>>>>>>>>>>>>>>>>      {
> >>>>>>>>>>>>>>>>>        if (mi_delta > 255)
> >>>>>>>>>>>>>>>>>       {
> >>>>>>>>>>>>>>>>> -       fputs ("\tldr\tr3, ", file);
> >>>>>>>>>>>>>>>>> -       assemble_name (file, label);
> >>>>>>>>>>>>>>>>> -       fputs ("+4\n", file);
> >>>>>>>>>>>>>>>>> +       /* With -mpure-code, we cannot load delta from the constant
> >>>>>>>>>>>>>>>>> +          pool: we build it explicitly.  */
> >>>>>>>>>>>>>>>>> +       if (target_pure_code)
> >>>>>>>>>>>>>>>>> +         {
> >>>>>>>>>>>>>>>>> +           bool mov_done_p = false;
> >>>>>>>>>>>>>>>>> +           int i;
> >>>>>>>>>>>>>>>>> +
> >>>>>>>>>>>>>>>>> +           /* Emit upper 3 bytes if needed.  */
> >>>>>>>>>>>>>>>>> +           for (i = 0; i < 3; i++)
> >>>>>>>>>>>>>>>>> +             {
> >>>>>>>>>>>>>>>>> +               int byte = (mi_delta >> (8 * (3 - i))) & 0xff;
> >>>>>>>>>>>>>>>>> +
> >>>>>>>>>>>>>>>>> +               if (byte)
> >>>>>>>>>>>>>>>>> +                 {
> >>>>>>>>>>>>>>>>> +                   if (mov_done_p)
> >>>>>>>>>>>>>>>>> +                     asm_fprintf (file, "\tadds\tr3, #%d\n", byte);
> >>>>>>>>>>>>>>>>> +                   else
> >>>>>>>>>>>>>>>>> +                     asm_fprintf (file, "\tmovs\tr3, #%d\n", byte);
> >>>>>>>>>>>>>>>>> +                   mov_done_p = true;
> >>>>>>>>>>>>>>>>> +                 }
> >>>>>>>>>>>>>>>>> +
> >>>>>>>>>>>>>>>>> +               if (mov_done_p)
> >>>>>>>>>>>>>>>>> +                 asm_fprintf (file, "\tlsls\tr3, #8\n");
> >>>>>>>>>>>>>>>>> +             }
> >>>>>>>>>>>>>>>>> +
> >>>>>>>>>>>>>>>>> +           /* Emit lower byte if needed.  */
> >>>>>>>>>>>>>>>>> +           if (!mov_done_p)
> >>>>>>>>>>>>>>>>> +             asm_fprintf (file, "\tmovs\tr3, #%d\n", mi_delta & 0xff);
> >>>>>>>>>>>>>>>>> +           else if (mi_delta & 0xff)
> >>>>>>>>>>>>>>>>> +             asm_fprintf (file, "\tadds\tr3, #%d\n", mi_delta & 0xff);
> >>>>>>>>>>>>>>>>> +         }
> >>>>>>>>>>>>>>>>> +       else
> >>>>>>>>>>>>>>>>> +         {
> >>>>>>>>>>>>>>>>> +           fputs ("\tldr\tr3, ", file);
> >>>>>>>>>>>>>>>>> +           assemble_name (file, label);
> >>>>>>>>>>>>>>>>> +           fputs ("+4\n", file);
> >>>>>>>>>>>>>>>>> +         }
> >>>>>>>>>>>>>>>>>         asm_fprintf (file, "\t%ss\t%r, %r, r3\n",
> >>>>>>>>>>>>>>>>>                      mi_op, this_regno, this_regno);
> >>>>>>>>>>>>>>>>>       }
> >>>>>>>>>>>>>>>>> @@ -28380,30 +28414,37 @@ arm_thumb1_mi_thunk (FILE *file, tree, HOST_WIDE_INT delta,
> >>>>>>>>>>>>>>>>>       fputs ("\tpop\t{r3}\n", file);
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>        fprintf (file, "\tbx\tr12\n");
> >>>>>>>>>>>>>>>>> -      ASM_OUTPUT_ALIGN (file, 2);
> >>>>>>>>>>>>>>>>> -      assemble_name (file, label);
> >>>>>>>>>>>>>>>>> -      fputs (":\n", file);
> >>>>>>>>>>>>>>>>> -      if (flag_pic)
> >>>>>>>>>>>>>>>>> +
> >>>>>>>>>>>>>>>>> +      /* With -mpure-code, we don't need to emit literals for the
> >>>>>>>>>>>>>>>>> +      function address and delta since we emitted code to build
> >>>>>>>>>>>>>>>>> +      them.  */
> >>>>>>>>>>>>>>>>> +      if (!target_pure_code)
> >>>>>>>>>>>>>>>>>       {
> >>>>>>>>>>>>>>>>> -       /* Output ".word .LTHUNKn-[3,7]-.LTHUNKPCn".  */
> >>>>>>>>>>>>>>>>> -       rtx tem = XEXP (DECL_RTL (function), 0);
> >>>>>>>>>>>>>>>>> -       /* For TARGET_THUMB1_ONLY the thunk is in Thumb mode, so the PC
> >>>>>>>>>>>>>>>>> -          pipeline offset is four rather than eight.  Adjust the offset
> >>>>>>>>>>>>>>>>> -          accordingly.  */
> >>>>>>>>>>>>>>>>> -       tem = plus_constant (GET_MODE (tem), tem,
> >>>>>>>>>>>>>>>>> -                            TARGET_THUMB1_ONLY ? -3 : -7);
> >>>>>>>>>>>>>>>>> -       tem = gen_rtx_MINUS (GET_MODE (tem),
> >>>>>>>>>>>>>>>>> -                            tem,
> >>>>>>>>>>>>>>>>> -                            gen_rtx_SYMBOL_REF (Pmode,
> >>>>>>>>>>>>>>>>> -                                                ggc_strdup (labelpc)));
> >>>>>>>>>>>>>>>>> -       assemble_integer (tem, 4, BITS_PER_WORD, 1);
> >>>>>>>>>>>>>>>>> -     }
> >>>>>>>>>>>>>>>>> -      else
> >>>>>>>>>>>>>>>>> -     /* Output ".word .LTHUNKn".  */
> >>>>>>>>>>>>>>>>> -     assemble_integer (XEXP (DECL_RTL (function), 0), 4, BITS_PER_WORD, 1);
> >>>>>>>>>>>>>>>>> +       ASM_OUTPUT_ALIGN (file, 2);
> >>>>>>>>>>>>>>>>> +       assemble_name (file, label);
> >>>>>>>>>>>>>>>>> +       fputs (":\n", file);
> >>>>>>>>>>>>>>>>> +       if (flag_pic)
> >>>>>>>>>>>>>>>>> +         {
> >>>>>>>>>>>>>>>>> +           /* Output ".word .LTHUNKn-[3,7]-.LTHUNKPCn".  */
> >>>>>>>>>>>>>>>>> +           rtx tem = XEXP (DECL_RTL (function), 0);
> >>>>>>>>>>>>>>>>> +           /* For TARGET_THUMB1_ONLY the thunk is in Thumb mode, so the PC
> >>>>>>>>>>>>>>>>> +              pipeline offset is four rather than eight.  Adjust the offset
> >>>>>>>>>>>>>>>>> +              accordingly.  */
> >>>>>>>>>>>>>>>>> +           tem = plus_constant (GET_MODE (tem), tem,
> >>>>>>>>>>>>>>>>> +                                TARGET_THUMB1_ONLY ? -3 : -7);
> >>>>>>>>>>>>>>>>> +           tem = gen_rtx_MINUS (GET_MODE (tem),
> >>>>>>>>>>>>>>>>> +                                tem,
> >>>>>>>>>>>>>>>>> +                                gen_rtx_SYMBOL_REF (Pmode,
> >>>>>>>>>>>>>>>>> +                                                    ggc_strdup (labelpc)));
> >>>>>>>>>>>>>>>>> +           assemble_integer (tem, 4, BITS_PER_WORD, 1);
> >>>>>>>>>>>>>>>>> +         }
> >>>>>>>>>>>>>>>>> +       else
> >>>>>>>>>>>>>>>>> +         /* Output ".word .LTHUNKn".  */
> >>>>>>>>>>>>>>>>> +         assemble_integer (XEXP (DECL_RTL (function), 0), 4, BITS_PER_WORD, 1);
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> -      if (TARGET_THUMB1_ONLY && mi_delta > 255)
> >>>>>>>>>>>>>>>>> -     assemble_integer (GEN_INT(mi_delta), 4, BITS_PER_WORD, 1);
> >>>>>>>>>>>>>>>>> +       if (TARGET_THUMB1_ONLY && mi_delta > 255)
> >>>>>>>>>>>>>>>>> +         assemble_integer (GEN_INT(mi_delta), 4, BITS_PER_WORD, 1);
> >>>>>>>>>>>>>>>>> +     }
> >>>>>>>>>>>>>>>>>      }
> >>>>>>>>>>>>>>>>>    else
> >>>>>>>>>>>>>>>>>      {
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>
> >>>>>
> >>>
> >
> > +class thumb1_const_rtl
> > ...
> > +  void mov (int val)
> >
> > This should take a HOST_WIDE_INT.  Similarly for add and shift.  The
> > same applies to the asm version as well.
> >
> > +    asm_fprintf (t_file, "\tmovs\tr%d, #%d\n", dst_regno, val);
> >
> > Should be using reg_names[dst_regno] in all cases.  In fact, you might
> > want to move that lookup to the constructor and just save a pointer to
> > the string there.  You'll need to use HOST_WIDE_INT_PRINT_UNSIGNED
>
> Correction, for a (signed) HOST_WIDE_INT, this should be
> HOST_WIDE_INT_PRINT_DEC.
>

Right, but if "val" is unsigned HOST_WIDE_INT, all the methods in the
two classes
can have unsigned HOST_WIDE_INT parameters, and thus use
HOST_WIDE_INT_PRINT_UNSIGNED?

> > rather than "%d" for the immediate.
> >
> > +template <class T>
> > +void
> > +thumb1_gen_const_int_1 (T dst, HOST_WIDE_INT op1)
> > +{
> > +  bool mov_done_p = false;
> > +  int val = op1;
> >
> > This potentially silently loses precision.  In fact, I think you really
> > want to use "unsigned HOST_WIDE_INT" throughout the following code, so
> > that the right shifts aren't undefined if dealing with negative numbers.
> >
> > For safety, you should also have an assertion in here that
> >
> >   op1 == trunc_int_for_mode (op1, SImode)
> >
> > +  int shift = 0;
> > +  int i;
> > +
> > +  if (val == 0)
> >
> > You can short-circuit 0..255 here for a quick exit.
> >
> > +    {
> > +      dst.mov (val);
> > +      return;
> > +    }
> >
> > Another trick: if the top nine bits of the 32-bit value are all set,
> > you're probably going to be better off (and certainly not worse off) by
> > generating -op1 and then negating the result in a final step - you can
> > do that via recursion.
> >
> > +
> > +  /* In the general case, we need 7 instructions to build
> > +     a 32 bits constant (1 movs, 3 lsls, 3 adds). We can
> > +     do better if VAL is small enough, or
> > +     right-shiftable by a suitable amount.  If the
> > +     right-shift enables to encode at least one less byte,
> > +     it's worth it: we save a adds and a lsls at the
> > +     expense of a final lsls.  */
> > +  int final_shift = number_of_first_bit_set (val);
> > +
> > +  int leading_zeroes = clz_hwi (val);
> > +  int number_of_bytes_needed
> > +    = ((HOST_BITS_PER_WIDE_INT - 1 - leading_zeroes)
> > +       / BITS_PER_UNIT) + 1;
> > +  int number_of_bytes_needed2
> > +    = ((HOST_BITS_PER_WIDE_INT - 1 - leading_zeroes - final_shift)
> > +       / BITS_PER_UNIT) + 1;
> > +
> > +  if (number_of_bytes_needed2 < number_of_bytes_needed)
> > +    val >>= final_shift;
> > +  else
> > +    final_shift = 0;
> > +
> > +  /* If we are in a very small range, we can use either a single movs
> > +     or movs+adds.  */
> > +  if ((val >= 0) && (val <= 510))
> >
> > if val is made unsigned HWI as I suggest, the lower bounds test is not
> > needed.
> >
> > +    {
> > +      if (val > 255)
> > +     {
> > +       int high = val - 255;
> >
> > Again, watch your types.
> >
> > +
> > +       dst.mov (high);
> > +       dst.add (255);
> > +     }
> > +      else
> > +     dst.mov (val);
> > +
> > +      if (final_shift > 0)
> > +     dst.ashift (final_shift);
> > +    }
> > +  else
> > +    {
> > +      /* General case, emit upper 3 bytes as needed.  */
> > +      for (i = 0; i < 3; i++)
> > +     {
> > +       int byte = (val >> (8 * (3 - i))) & 0xff;
> >
> > and here.
> >
> > +
> > +       if (byte)
> > +         {
> > +           /* We are about to emit new bits, stop accumulating a
> > +              shift amount, and left-shift only if we have already
> > +              emitted some upper bits.  */
> > +           if (mov_done_p)
> > +             {
> > +               dst.ashift (shift);
> > +               dst.add (byte);
> > +             }
> > +           else
> > +             dst.mov (byte);
> > +
> > +           /* Stop accumulating shift amount since we've just
> > +              emitted some bits.  */
> > +           shift = 0;
> > +
> > +           mov_done_p = true;
> > +         }
> > +
> > +       if (mov_done_p)
> > +         shift += 8;
> > +     }
> > +
> > +      /* Emit lower byte.  */
> > +      if (!mov_done_p)
> > +     dst.mov (val & 0xff);
> > +      else
> > +     {
> > +       dst.ashift (shift);
> > +       if (val & 0xff)
> > +         dst.add (val & 0xff);
> > +     }
> > +
> > +      if (final_shift > 0)
> > +     dst.ashift (final_shift);
> > +    }
> > +}
> > +
> >
>