[PING][PATCH 2/2] arm: Add support for MVE Tail-Predicated Low Overhead Loops

Tue Dec 12 17:56:22 GMT 2023


On 30/11/2023 12:55, Stamatis Markianos-Wright wrote:
> Hi Andre,
> 
> Thanks for the comments, see latest revision attached.
> 
> On 27/11/2023 12:47, Andre Vieira (lists) wrote:
>> Hi Stam,
>>
>> Just some comments.
>>
>> +/* Recursively scan through the DF chain backwards within the basic 
>> block and
>> +   determine if any of the USEs of the original insn (or the USEs of 
>> the insns
>> s/Recursively scan/Scan/ as you no longer recurse, thanks for that by 
>> the way :) +   where thy were DEF-ed, etc., recursively) were affected 
>> by implicit VPT
>> remove recursively for the same reasons.
>>
>> +      if (!CONST_INT_P (cond_counter_iv.step) || !CONST_INT_P 
>> (cond_temp_iv.step))
>> +    return NULL;
>> +      /* Look at the steps and swap around the rtx's if needed. Error 
>> out if
>> +     one of them cannot be identified as constant.  */
>> +      if (INTVAL (cond_counter_iv.step) != 0 && INTVAL 
>> (cond_temp_iv.step) != 0)
>> +    return NULL;
>>
>> Move the comment above the if before, as the erroring out it talks 
>> about is there.
> Done
>>
>> +      emit_note_after ((enum insn_note)NOTE_KIND (insn), BB_END (body));
>>  space after 'insn_note)'
>>
>> @@ -173,14 +176,14 @@ doloop_condition_get (rtx_insn *doloop_pat)
>>    if (! REG_P (reg))
>>      return 0;
>>  -  /* Check if something = (plus (reg) (const_int -1)).
>> +  /* Check if something = (plus (reg) (const_int -n)).
>>       On IA-64, this decrement is wrapped in an if_then_else.  */
>>    inc_src = SET_SRC (inc);
>>    if (GET_CODE (inc_src) == IF_THEN_ELSE)
>>      inc_src = XEXP (inc_src, 1);
>>    if (GET_CODE (inc_src) != PLUS
>>        || XEXP (inc_src, 0) != reg
>> -      || XEXP (inc_src, 1) != constm1_rtx)
>> +      || !CONST_INT_P (XEXP (inc_src, 1)))
>>
>> Do we ever check that inc_src is negative? We used to check if it was 
>> -1, now we only check it's a constnat, but not a negative one, so I 
>> suspect this needs a:
>> || INTVAL (XEXP (inc_src, 1)) >= 0
> Good point. Done
>>
>> @@ -492,7 +519,8 @@ doloop_modify (class loop *loop, class niter_desc 
>> *desc,
>>      case GE:
>>        /* Currently only GE tests against zero are supported.  */
>>        gcc_assert (XEXP (condition, 1) == const0_rtx);
>> -
>> +      /* FALLTHRU */
>> +    case GTU:
>>        noloop = constm1_rtx;
>>
>> I spent a very long time staring at this trying to understand why 
>> noloop = constm1_rtx for GTU, where I thought it should've been (count 
>> & (n-1)). For the current use of doloop it doesn't matter because ARM 
>> is the only target using it and you set desc->noloop_assumptions to 
>> null_rtx in 'arm_attempt_dlstp_transform' so noloop is never used. 
>> However, if a different target accepts this GTU pattern then this 
>> target agnostic code will do the wrong thing.  I suggest we either:
>>  - set noloop to what we think might be the correct value, which if 
>> you ask me should be 'count & (XEXP (condition, 1))',
>>  - or add a gcc_assert (GET_CODE (condition) != GTU); under the if 
>> (desc->noloop_assumption); part and document why.  I have a slight 
>> preference for the assert given otherwise we are adding code that we 
>> can't test.
> 
> Yea, that's true tbh. I've done the latter, but also separated out the 
> "case GTU:" and added a comment, so that it's more clear that the noloop 
> things aren't used in the only implemented GTU case (Arm)
> 
> Thank you :)
> 
>>
>> LGTM otherwise (but I don't have the power to approve this ;)).
>>
>> Kind regards,
>> Andre
>> ________________________________________
>> From: Stamatis Markianos-Wright <stam.markianos-wright@arm.com>
>> Sent: Thursday, November 16, 2023 11:36 AM
>> To: Stamatis Markianos-Wright via Gcc-patches; Richard Earnshaw; 
>> Richard Sandiford; Kyrylo Tkachov
>> Subject: [PING][PATCH 2/2] arm: Add support for MVE Tail-Predicated 
>> Low Overhead Loops
>>
>> Pinging back to the top of reviewers' inboxes due to worry about Stage 1
>> End in a few days :)
>>
>>
>> See the last email for the latest version of the 2/2 patch. The 1/2
>> patch is A-Ok from Kyrill's earlier target-backend review.
>>
>>
>> On 10/11/2023 12:41, Stamatis Markianos-Wright wrote:
>>>
>>> On 06/11/2023 17:29, Stamatis Markianos-Wright wrote:
>>>>
>>>> On 06/11/2023 11:24, Richard Sandiford wrote:
>>>>> Stamatis Markianos-Wright <stam.markianos-wright@arm.com> writes:
>>>>>>> One of the main reasons for reading the arm bits was to try to 
>>>>>>> answer
>>>>>>> the question: if we switch to a downcounting loop with a GE
>>>>>>> condition,
>>>>>>> how do we make sure that the start value is not a large unsigned
>>>>>>> number that is interpreted as negative by GE?  E.g. if the loop
>>>>>>> originally counted up in steps of N and used an LTU condition,
>>>>>>> it could stop at a value in the range [INT_MAX + 1, UINT_MAX].
>>>>>>> But the loop might never iterate if we start counting down from
>>>>>>> most values in that range.
>>>>>>>
>>>>>>> Does the patch handle that?
>>>>>> So AFAICT this is actually handled in the generic code in
>>>>>> `doloop_valid_p`:
>>>>>>
>>>>>> This kind of loops fail because of they are "desc->infinite", then no
>>>>>> loop-doloop conversion is attempted at all (even for standard
>>>>>> dls/le loops)
>>>>>>
>>>>>> Thanks to that check I haven't been able to trigger anything like the
>>>>>> behaviour you describe, do you think the doloop_valid_p checks are
>>>>>> robust enough?
>>>>> The loops I was thinking of are provably not infinite though. E.g.:
>>>>>
>>>>>    for (unsigned int i = 0; i < UINT_MAX - 100; ++i)
>>>>>      ...
>>>>>
>>>>> is known to terminate.  And doloop conversion is safe with the normal
>>>>> count-down-by-1 approach, so I don't think current code would need
>>>>> to reject it.  I.e. a conversion to:
>>>>>
>>>>>    unsigned int i = UINT_MAX - 101;
>>>>>    do
>>>>>      ...
>>>>>    while (--i != ~0U);
>>>>>
>>>>> would be safe, but a conversion to:
>>>>>
>>>>>    int i = UINT_MAX - 101;
>>>>>    do
>>>>>      ...
>>>>>    while ((i -= step, i > 0));
>>>>>
>>>>> wouldn't, because the loop body would only be executed once.
>>>>>
>>>>> I'm only going off the name "infinite" though :)  It's possible that
>>>>> it has more connotations than that.
>>>>>
>>>>> Thanks,
>>>>> Richard
>>>>
>>>> Ack, yep, I see what you mean now, and yep, that kind of loop does
>>>> indeed pass through doloop_valid_p
>>>>
>>>> Interestingly , in the v8-M Arm ARM this is done with:
>>>>
>>>> ```
>>>>
>>>> boolean IsLastLowOverheadLoop(INSTR_EXEC_STATE_Type state)
>>>> // This does not check whether a loop is currently active.
>>>> // If the PE were in a loop, would this be the last one?
>>>> return UInt(state.LoopCount) <= (1 << (4 - LTPSIZE));
>>>>
>>>> ```
>>>>
>>>> So architecturally the asm we output would be ok (except maybe the
>>>> "branch too far subs;bgt;lctp" fallback at
>>>> `predicated_doloop_end_internal` (maybe that should be `bhi`))... But
>>>> now GE: isn't looking like an accurate representation of this
>>>> operation in the compiler.
>>>>
>>>> I'm wondering if I should try to make
>>>> `predicated_doloop_end_internal` contain a comparison along the lines
>>>> of:
>>>> (gtu: (plus: (LR) (const_int -num_lanes)) (const_int 
>>>> num_lanes_minus_1))
>>>>
>>>> I'll give that a try :)
>>>>
>>>> The only reason I'd chosen to go with GE earlier, tbh, was because of
>>>> the existing handling of GE in loop-doloop.cc
>>>>
>>>> Let me know if any other ideas come to your mind!
>>>>
>>>>
>>>> Cheers,
>>>>
>>>> Stam
>>>
>>>
>>> It looks like I've had success with the below (diff to previous patch),
>>> trimmed a bit to only the functionally interesting things::
>>>
>>>
>>>
>>>
>>> diff --git a/gcc/config/arm/thumb2.md b/gcc/config/arm/thumb2.md
>>> index 368d5138ca1..54dd4ee564b 100644
>>> --- a/gcc/config/arm/thumb2.md
>>> +++ b/gcc/config/arm/thumb2.md
>>> @@ -1649,16 +1649,28 @@
>>>            && (decrement_num = arm_attempt_dlstp_transform 
>>> (operands[1]))
>>>            && (INTVAL (decrement_num) != 1))
>>>          {
>>> -          insn = emit_insn
>>> -              (gen_thumb2_addsi3_compare0
>>> -              (s0, s0, GEN_INT ((-1) * (INTVAL (decrement_num)))));
>>> -          cmp = XVECEXP (PATTERN (insn), 0, 0);
>>> -          cc_reg = SET_DEST (cmp);
>>> -          bcomp = gen_rtx_GE (VOIDmode, cc_reg, const0_rtx);
>>>            loc_ref = gen_rtx_LABEL_REF (VOIDmode, operands[1]);
>>> -          emit_jump_insn (gen_rtx_SET (pc_rtx,
>>> -                       gen_rtx_IF_THEN_ELSE (VOIDmode, bcomp,
>>> -                                 loc_ref, pc_rtx)));
>>> +          switch (INTVAL (decrement_num))
>>> +        {
>>> +          case 2:
>>> +            insn = emit_jump_insn (gen_predicated_doloop_end_internal2
>>> +                        (s0, loc_ref));
>>> +            break;
>>> +          case 4:
>>> +            insn = emit_jump_insn (gen_predicated_doloop_end_internal4
>>> +                        (s0, loc_ref));
>>> +            break;
>>> +          case 8:
>>> +            insn = emit_jump_insn (gen_predicated_doloop_end_internal8
>>> +                        (s0, loc_ref));
>>> +            break;
>>> +          case 16:
>>> +            insn = emit_jump_insn (gen_predicated_doloop_end_internal16
>>> +                        (s0, loc_ref));
>>> +            break;
>>> +          default:
>>> +            gcc_unreachable ();
>>> +        }
>>>            DONE;
>>>          }
>>>      }
>>>
>>> diff --git a/gcc/config/arm/mve.md b/gcc/config/arm/mve.md
>>> index 93905583b18..c083f965fa9 100644
>>> --- a/gcc/config/arm/mve.md
>>> +++ b/gcc/config/arm/mve.md
>>> @@ -6922,23 +6922,24 @@
>>>  ;; Originally expanded by 'predicated_doloop_end'.
>>>  ;; In the rare situation where the branch is too far, we do also 
>>> need to
>>>  ;; revert FPSCR.LTPSIZE back to 0x100 after the last iteration.
>>> -(define_insn "*predicated_doloop_end_internal"
>>> +(define_insn "predicated_doloop_end_internal<letp_num_lanes>"
>>>    [(set (pc)
>>>      (if_then_else
>>> -       (ge (plus:SI (reg:SI LR_REGNUM)
>>> -            (match_operand:SI 0 "const_int_operand" ""))
>>> -        (const_int 0))
>>> -     (label_ref (match_operand 1 "" ""))
>>> +       (gtu (unspec:SI [(plus:SI (match_operand:SI 0
>>> "s_register_operand" "=r")
>>> +                     (const_int <letp_num_lanes_neg>))]
>>> +        LETP)
>>> +        (const_int <letp_num_lanes_minus_1>))
>>> +     (match_operand 1 "" "")
>>>       (pc)))
>>> -   (set (reg:SI LR_REGNUM)
>>> -    (plus:SI (reg:SI LR_REGNUM) (match_dup 0)))
>>> +   (set (match_dup 0)
>>> +    (plus:SI (match_dup 0) (const_int <letp_num_lanes_neg>)))
>>>     (clobber (reg:CC CC_REGNUM))]
>>>    "TARGET_HAVE_MVE"
>>>    {
>>>      if (get_attr_length (insn) == 4)
>>>        return "letp\t%|lr, %l1";
>>>      else
>>> -      return "subs\t%|lr, #%n0\n\tbgt\t%l1\n\tlctp";
>>> +      return "subs\t%|lr, #<letp_num_lanes>\n\tbhi\t%l1\n\tlctp";
>>>    }
>>>    [(set (attr "length")
>>>      (if_then_else
>>> @@ -6947,11 +6948,11 @@
>>>          (const_int 6)))
>>>     (set_attr "type" "branch")])
>>>
>>> -(define_insn "dlstp<mode1>_insn"
>>> +(define_insn "dlstp<dlstp_elemsize>_insn"
>>>    [
>>>      (set (reg:SI LR_REGNUM)
>>>       (unspec:SI [(match_operand:SI 0 "s_register_operand" "r")]
>>>        DLSTP))
>>>    ]
>>>    "TARGET_HAVE_MVE"
>>> -  "dlstp.<mode1>\t%|lr, %0")
>>> +  "dlstp.<dlstp_elemsize>\t%|lr, %0")
>>>
>>> diff --git a/gcc/loop-doloop.cc b/gcc/loop-doloop.cc
>>> index 6a72700a127..47fdef989b4 100644
>>> --- a/gcc/loop-doloop.cc
>>> +++ b/gcc/loop-doloop.cc
>>> @@ -185,6 +185,7 @@ doloop_condition_get (rtx_insn *doloop_pat)
>>>        || XEXP (inc_src, 0) != reg
>>>        || !CONST_INT_P (XEXP (inc_src, 1)))
>>>      return 0;
>>> +  int dec_num = abs (INTVAL (XEXP (inc_src, 1)));
>>>
>>>    /* Check for (set (pc) (if_then_else (condition)
>>>                                         (label_ref (label))
>>> @@ -199,21 +200,32 @@ doloop_condition_get (rtx_insn *doloop_pat)
>>>    /* Extract loop termination condition.  */
>>>    condition = XEXP (SET_SRC (cmp), 0);
>>>
>>> -  /* We expect a GE or NE comparison with 0 or 1.  */
>>> -  if ((GET_CODE (condition) != GE
>>> -       && GET_CODE (condition) != NE)
>>> -      || (XEXP (condition, 1) != const0_rtx
>>> -          && XEXP (condition, 1) != const1_rtx))
>>> +  /* We expect a GE or NE comparison with 0 or 1, or a GTU comparison
>>> with
>>> +     dec_num - 1.  */
>>> +  if (!((GET_CODE (condition) == GE
>>> +     || GET_CODE (condition) == NE)
>>> +    && (XEXP (condition, 1) == const0_rtx
>>> +        || XEXP (condition, 1) == const1_rtx ))
>>> +      &&!(GET_CODE (condition) == GTU
>>> +      && ((INTVAL (XEXP (condition, 1))) == (dec_num - 1))))
>>>      return 0;
>>>
>>> -  if ((XEXP (condition, 0) == reg)
>>> +  /* For the ARM special case of having a GTU: re-form the condition
>>> without
>>> +     the unspec for the benefit of the middle-end.  */
>>> +  if (GET_CODE (condition) == GTU)
>>> +    {
>>> +      condition = gen_rtx_fmt_ee (GTU, VOIDmode, inc_src, GEN_INT
>>> (dec_num - 1));
>>> +      return condition;
>>> +    }
>>> +  else if ((XEXP (condition, 0) == reg)
>>>        /* For the third case:  */
>>>        || ((cc_reg != NULL_RTX)
>>>        && (XEXP (condition, 0) == cc_reg)
>>>        && (reg_orig == reg))
>>> @@ -245,20 +257,11 @@ doloop_condition_get (rtx_insn *doloop_pat)
>>>                         (label_ref (label))
>>>                         (pc))))])
>>>
>>> -    So we return the second form instead for the two cases when n == 1.
>>> -
>>> -    For n > 1, the final value may be exceeded, so use GE instead of 
>>> NE.
>>> +    So we return the second form instead for the two cases.
>>>       */
>>> -     if (GET_CODE (pattern) != PARALLEL)
>>> -       {
>>> -    if (INTVAL (XEXP (inc_src, 1)) != -1)
>>> -      condition = gen_rtx_fmt_ee (GE, VOIDmode, inc_src, const0_rtx);
>>> -    else
>>> -      condition = gen_rtx_fmt_ee (NE, VOIDmode, inc_src, const1_rtx);;
>>> -       }
>>> -
>>> +    condition = gen_rtx_fmt_ee (NE, VOIDmode, inc_src, const1_rtx);
>>>      return condition;
>>> -   }
>>> +    }
>>>
>>>    /* ??? If a machine uses a funny comparison, we could return a
>>>       canonicalized form here.  */
>>> @@ -501,7 +504,8 @@ doloop_modify (class loop *loop, class niter_desc
>>> *desc,
>>>      case GE:
>>>        /* Currently only GE tests against zero are supported. */
>>>        gcc_assert (XEXP (condition, 1) == const0_rtx);
>>> -
>>> +      /* FALLTHRU */
>>> +    case GTU:
>>>        noloop = constm1_rtx;
>>> diff --git a/gcc/config/arm/iterators.md b/gcc/config/arm/iterators.md
>>> index a6a7ff507a5..9398702cddd 100644
>>> --- a/gcc/config/arm/iterators.md
>>> +++ b/gcc/config/arm/iterators.md
>>> @@ -2673,8 +2673,16 @@
>>>  (define_int_attr mrrc [(VUNSPEC_MRRC "mrrc") (VUNSPEC_MRRC2 "mrrc2")])
>>>  (define_int_attr MRRC [(VUNSPEC_MRRC "MRRC") (VUNSPEC_MRRC2 "MRRC2")])
>>>
>>> -(define_int_attr mode1 [(DLSTP8 "8") (DLSTP16 "16") (DLSTP32 "32")
>>> -            (DLSTP64 "64")])
>>> +(define_int_attr dlstp_elemsize [(DLSTP8 "8") (DLSTP16 "16") (DLSTP32
>>> "32")
>>> +                 (DLSTP64 "64")])
>>> +
>>> +(define_int_attr letp_num_lanes [(LETP8 "16") (LETP16 "8") (LETP32 "4")
>>> +                 (LETP64 "2")])
>>> +(define_int_attr letp_num_lanes_neg [(LETP8 "-16") (LETP16 "-8")
>>> (LETP32 "-4")
>>> +                     (LETP64 "-2")])
>>> +
>>> +(define_int_attr letp_num_lanes_minus_1 [(LETP8 "15") (LETP16 "7")
>>> (LETP32 "3")
>>> +                     (LETP64 "1")])
>>>
>>>  (define_int_attr opsuffix [(UNSPEC_DOT_S "s8")
>>>                 (UNSPEC_DOT_U "u8")
>>> @@ -2921,6 +2929,8 @@
>>>  (define_int_iterator VQSHLUQ_N [VQSHLUQ_N_S])
>>>  (define_int_iterator DLSTP [DLSTP8 DLSTP16 DLSTP32
>>>                     DLSTP64])
>>> +(define_int_iterator LETP [LETP8 LETP16 LETP32
>>> +               LETP64])
>>>
>>>  ;; Define iterators for VCMLA operations
>>>  (define_int_iterator VCMLA_OP [UNSPEC_VCMLA
>>>        /* The iteration count does not need incrementing for a GE
>>> test.  */
>>> diff --git a/gcc/config/arm/unspecs.md b/gcc/config/arm/unspecs.md
>>> index 12ae4c4f820..2d6f27c14f4 100644
>>> --- a/gcc/config/arm/unspecs.md
>>> +++ b/gcc/config/arm/unspecs.md
>>> @@ -587,6 +587,10 @@
>>>    DLSTP16
>>>    DLSTP32
>>>    DLSTP64
>>> +  LETP8
>>> +  LETP16
>>> +  LETP32
>>> +  LETP64
>>>    VPNOT
>>>    VCREATEQ_F
>>>    VCVTQ_N_TO_F_S
>>>
>>>
>>> I've attached the whole [2/2] patch diff with this change and
>>> the required comment changes in doloop_condition_get.
>>> WDYT?
>>>
>>>
>>> Thanks,
>>>
>>> Stam
>>>
>>>
>>>>
>>>>

[I'm still working through this patch, but there are a number of things 
which clearly need addressing, so I'll stop at this point today.]

+arm_predict_doloop_p (struct loop *loop)
+{

Is it feasible to add something here to check the overall size of the 
loop, so that we don't try to convert loops that are clearly too big?

+static rtx_insn*
+arm_mve_dlstp_check_inc_counter (basic_block body, rtx_insn* vctp_insn,
+				 rtx condconst, rtx condcount)
...
+  rtx condcount_reg_set = PATTERN (DF_REF_INSN (condcount_reg_set_df));
+  rtx_insn* vctp_reg_set = DF_REF_INSN (vctp_reg_set_df);
+  /* Ensure the modification of the vctp reg from df is consistent with
+     the iv and the number of lanes on the vctp insn.  */
+  if (!(GET_CODE (XEXP (PATTERN (vctp_reg_set), 1)) == PLUS
+      && REGNO (XEXP (PATTERN (vctp_reg_set), 0))
+	  == REGNO (XEXP (XEXP (PATTERN (vctp_reg_set), 1), 0))))

You seem to be assuming that the pattern in insn you've found will 
always be of the form (set x y), but that's unsafe.  When scanning RTL 
you must first check that you do have a genuine single set.  The easiest 
way to do that is to call single_set (insn), which returns the SET RTL 
if it is a single set, and NULL otherwise; it is also able to handle 
irrelevant clobbers which might be added for register allocation purposes.
Also, rather than using XEXP, you should use SET_SRC and SET_DEST when 
looking at the arm's of a SET operation, for better clarity.

+	    if (!single_set (last_set_insn))
+	      return NULL;
+	    rtx counter_orig_set;
+	    counter_orig_set = XEXP (PATTERN (last_set_insn), 1);

There's a similar problem here, in that single_set returns a valid value 
for something like
   (parallel [(set a b)
              (clobber scratch_reg)])

so looking directly at the pattern of the insn is wrong.  Instead you 
should use the value returned by single_set for further analysis.

+  /* Next, ensure that it is a PLUS of the form:
+     (set (reg a) (plus (reg a) (const_int)))
+     where (reg a) is the same as condcount.  */
+  if (GET_CODE (XEXP (PATTERN (dec_insn), 1)) == PLUS
+      && REGNO (XEXP (PATTERN (dec_insn), 0))
+	  == REGNO (XEXP (XEXP (PATTERN (dec_insn), 1), 0))
+      && REGNO (XEXP (PATTERN (dec_insn), 0)) == REGNO (condcount))

and again, you need to validate that dec_insn is a set first.  There are 
several other cases where you need to check for a SET as well, but I 
won't mention any more here.

Can this code be run before register allocation? If so, there's a risk 
that we will have different pseudos for the source and dest operands 
here, but expect register allocation to tie them back together; 
something like

t1 = count
loop_head:
   ...
   t2 = t1
   ...
   t1 = t2 + const
   if (t1 < end)
      goto loop_head;

Register allocation would be expected to eliminate the t2 = t1 insn by 
allocating the same physical register here.


+    decrementnum = abs (INTVAL (XEXP (XEXP (PATTERN (dec_insn), 1), 1)));

why the use of abs()?  I don't see where we validate that the direction 
really matches our expectation.

+      if (abs (INTVAL (vctp_reg_iv.step))
+	  != arm_mve_get_vctp_lanes (PATTERN (vctp_insn)))

And again, we only look at the absolute value.  Shouldn't we be 
validating the sign here against the sign we ignored earlier?

+  if (!(TARGET_32BIT && TARGET_HAVE_LOB && optimize > 0))

Again, the test for TARGET_32BIT seems redundant.

+  if (single_set (next_use1)
+      && GET_CODE (SET_SRC (single_set (next_use1))) == ZERO_EXTEND)

Don't call single_set twice, just save the result of the first call and 
use that.  Perhaps
   rtx next_use1_set = single_set (next_use1);
   if (next_use1_set && ...)

+/* Attempt to transform the loop contents of loop basic block from VPT
+   predicated insns into unpredicated insns for a dlstp/letp loop.  */
+
+rtx
+arm_attempt_dlstp_transform (rtx label)

Please describe what is returned in the comment.  I'm guessing it's some 
form of iteration count, but why return 1 on failure?

+    return GEN_INT (1);

It's more efficient to write "return const1_rtx;".  In fact, it looks 
like this function always returns a CONST_INT and is only ever called 
from one place in thumb2.md, where we only ever look at the integer 
value.  So why not make it return a HOST_WIDE_INT in the first place?


+static bool
+arm_emit_mve_unpredicated_insn_to_seq (rtx_insn* insn)
+{
+
I think this function needs to also copy across the INSN_LOCATION 
information from the insn it is rewriting; that way we keep any 
diagnostic information and debug information more accurate.  Something like:

+    }
 >> INSN_LOCATION (new_insn) = INSN_LOCATION (insn);
+  return true;

Will be enough if we never get more than one insn emitted from the 
emit_insn sequence above (emit_insn returns new_insn, which you'll need 
to save).  If you need something more complex, then perhaps you could use
   emit_insn_after_setloc (GEN_FCN..., get_last_insn (),
			  INSN_LOCATION (insn));

+  for (insn = seq; NEXT_INSN (insn); insn = NEXT_INSN (insn))
+    if (NOTE_P (insn))
+      emit_note_after ((enum insn_note)NOTE_KIND (insn), BB_END (body));
+    else if (DEBUG_INSN_P (insn))
+      emit_debug_insn_after (PATTERN (insn), BB_END (body));
+    else
+      emit_insn_after (PATTERN (insn), BB_END (body));
+
+

I'm not sure I follow why you can't replace this entire loop with

   emit_insn_after (seq, BB_END (body));

which should do all of the above for you.  But there's another problem here:

+    if (NOTE_P (insn))
+      emit_note_after ((enum insn_note)NOTE_KIND (insn), BB_END (body));

Notes have data (see emit_note_copy in emit-rtl.cc).  If you can't use 
that function, you'll need to copy the note data manually in the same 
way as emit_note_copy does (or add emit_note_copy_after() as a new 
function in emit-rtl.cc).

I also note that you're already copying the note in 
arm_attempt_dlstp_transform and that also drops the note's data, but 
that really can use emit_note_copy().

thumb2.md:
+	  /* If we have a compatibe MVE target, try and analyse the loop

Typo: compatible

+++ b/gcc/testsuite/gcc.target/arm/mve/dlstp-int16x8.c
@@ -0,0 +1,69 @@
+/* { dg-do run { target { arm*-*-* } } } */
+/* { dg-require-effective-target arm_v8_1m_mve_ok } */
+/* { dg-require-effective-target arm_mve_hw } */
+/* { dg-options "-O2 -save-temps" } */
+/* { dg-add-options arm_v8_1m_mve } */
+

...

+/* { dg-final { scan-assembler-times {\tdlstp.16} 1 } } */

Please do not mix scan-assembler tests with execution tests.  The latter 
require specific hardware to run, while the former do not, that means 
these tests get run by far fewer configurations of the compiler.