[PING][PATCH 2/2] arm: Add support for MVE Tail-Predicated Low Overhead Loops

Richard Earnshaw Richard.Earnshaw@foss.arm.com
Tue Dec 12 17:56:22 GMT 2023



On 30/11/2023 12:55, Stamatis Markianos-Wright wrote:
> Hi Andre,
> 
> Thanks for the comments, see latest revision attached.
> 
> On 27/11/2023 12:47, Andre Vieira (lists) wrote:
>> Hi Stam,
>>
>> Just some comments.
>>
>> +/* Recursively scan through the DF chain backwards within the basic 
>> block and
>> +   determine if any of the USEs of the original insn (or the USEs of 
>> the insns
>> s/Recursively scan/Scan/ as you no longer recurse, thanks for that by 
>> the way :) +   where thy were DEF-ed, etc., recursively) were affected 
>> by implicit VPT
>> remove recursively for the same reasons.
>>
>> +      if (!CONST_INT_P (cond_counter_iv.step) || !CONST_INT_P 
>> (cond_temp_iv.step))
>> +    return NULL;
>> +      /* Look at the steps and swap around the rtx's if needed. Error 
>> out if
>> +     one of them cannot be identified as constant.  */
>> +      if (INTVAL (cond_counter_iv.step) != 0 && INTVAL 
>> (cond_temp_iv.step) != 0)
>> +    return NULL;
>>
>> Move the comment above the if before, as the erroring out it talks 
>> about is there.
> Done
>>
>> +      emit_note_after ((enum insn_note)NOTE_KIND (insn), BB_END (body));
>>  space after 'insn_note)'
>>
>> @@ -173,14 +176,14 @@ doloop_condition_get (rtx_insn *doloop_pat)
>>    if (! REG_P (reg))
>>      return 0;
>>  -  /* Check if something = (plus (reg) (const_int -1)).
>> +  /* Check if something = (plus (reg) (const_int -n)).
>>       On IA-64, this decrement is wrapped in an if_then_else.  */
>>    inc_src = SET_SRC (inc);
>>    if (GET_CODE (inc_src) == IF_THEN_ELSE)
>>      inc_src = XEXP (inc_src, 1);
>>    if (GET_CODE (inc_src) != PLUS
>>        || XEXP (inc_src, 0) != reg
>> -      || XEXP (inc_src, 1) != constm1_rtx)
>> +      || !CONST_INT_P (XEXP (inc_src, 1)))
>>
>> Do we ever check that inc_src is negative? We used to check if it was 
>> -1, now we only check it's a constnat, but not a negative one, so I 
>> suspect this needs a:
>> || INTVAL (XEXP (inc_src, 1)) >= 0
> Good point. Done
>>
>> @@ -492,7 +519,8 @@ doloop_modify (class loop *loop, class niter_desc 
>> *desc,
>>      case GE:
>>        /* Currently only GE tests against zero are supported.  */
>>        gcc_assert (XEXP (condition, 1) == const0_rtx);
>> -
>> +      /* FALLTHRU */
>> +    case GTU:
>>        noloop = constm1_rtx;
>>
>> I spent a very long time staring at this trying to understand why 
>> noloop = constm1_rtx for GTU, where I thought it should've been (count 
>> & (n-1)). For the current use of doloop it doesn't matter because ARM 
>> is the only target using it and you set desc->noloop_assumptions to 
>> null_rtx in 'arm_attempt_dlstp_transform' so noloop is never used. 
>> However, if a different target accepts this GTU pattern then this 
>> target agnostic code will do the wrong thing.  I suggest we either:
>>  - set noloop to what we think might be the correct value, which if 
>> you ask me should be 'count & (XEXP (condition, 1))',
>>  - or add a gcc_assert (GET_CODE (condition) != GTU); under the if 
>> (desc->noloop_assumption); part and document why.  I have a slight 
>> preference for the assert given otherwise we are adding code that we 
>> can't test.
> 
> Yea, that's true tbh. I've done the latter, but also separated out the 
> "case GTU:" and added a comment, so that it's more clear that the noloop 
> things aren't used in the only implemented GTU case (Arm)
> 
> Thank you :)
> 
>>
>> LGTM otherwise (but I don't have the power to approve this ;)).
>>
>> Kind regards,
>> Andre
>> ________________________________________
>> From: Stamatis Markianos-Wright <stam.markianos-wright@arm.com>
>> Sent: Thursday, November 16, 2023 11:36 AM
>> To: Stamatis Markianos-Wright via Gcc-patches; Richard Earnshaw; 
>> Richard Sandiford; Kyrylo Tkachov
>> Subject: [PING][PATCH 2/2] arm: Add support for MVE Tail-Predicated 
>> Low Overhead Loops
>>
>> Pinging back to the top of reviewers' inboxes due to worry about Stage 1
>> End in a few days :)
>>
>>
>> See the last email for the latest version of the 2/2 patch. The 1/2
>> patch is A-Ok from Kyrill's earlier target-backend review.
>>
>>
>> On 10/11/2023 12:41, Stamatis Markianos-Wright wrote:
>>>
>>> On 06/11/2023 17:29, Stamatis Markianos-Wright wrote:
>>>>
>>>> On 06/11/2023 11:24, Richard Sandiford wrote:
>>>>> Stamatis Markianos-Wright <stam.markianos-wright@arm.com> writes:
>>>>>>> One of the main reasons for reading the arm bits was to try to 
>>>>>>> answer
>>>>>>> the question: if we switch to a downcounting loop with a GE
>>>>>>> condition,
>>>>>>> how do we make sure that the start value is not a large unsigned
>>>>>>> number that is interpreted as negative by GE?  E.g. if the loop
>>>>>>> originally counted up in steps of N and used an LTU condition,
>>>>>>> it could stop at a value in the range [INT_MAX + 1, UINT_MAX].
>>>>>>> But the loop might never iterate if we start counting down from
>>>>>>> most values in that range.
>>>>>>>
>>>>>>> Does the patch handle that?
>>>>>> So AFAICT this is actually handled in the generic code in
>>>>>> `doloop_valid_p`:
>>>>>>
>>>>>> This kind of loops fail because of they are "desc->infinite", then no
>>>>>> loop-doloop conversion is attempted at all (even for standard
>>>>>> dls/le loops)
>>>>>>
>>>>>> Thanks to that check I haven't been able to trigger anything like the
>>>>>> behaviour you describe, do you think the doloop_valid_p checks are
>>>>>> robust enough?
>>>>> The loops I was thinking of are provably not infinite though. E.g.:
>>>>>
>>>>>    for (unsigned int i = 0; i < UINT_MAX - 100; ++i)
>>>>>      ...
>>>>>
>>>>> is known to terminate.  And doloop conversion is safe with the normal
>>>>> count-down-by-1 approach, so I don't think current code would need
>>>>> to reject it.  I.e. a conversion to:
>>>>>
>>>>>    unsigned int i = UINT_MAX - 101;
>>>>>    do
>>>>>      ...
>>>>>    while (--i != ~0U);
>>>>>
>>>>> would be safe, but a conversion to:
>>>>>
>>>>>    int i = UINT_MAX - 101;
>>>>>    do
>>>>>      ...
>>>>>    while ((i -= step, i > 0));
>>>>>
>>>>> wouldn't, because the loop body would only be executed once.
>>>>>
>>>>> I'm only going off the name "infinite" though :)  It's possible that
>>>>> it has more connotations than that.
>>>>>
>>>>> Thanks,
>>>>> Richard
>>>>
>>>> Ack, yep, I see what you mean now, and yep, that kind of loop does
>>>> indeed pass through doloop_valid_p
>>>>
>>>> Interestingly , in the v8-M Arm ARM this is done with:
>>>>
>>>> ```
>>>>
>>>> boolean IsLastLowOverheadLoop(INSTR_EXEC_STATE_Type state)
>>>> // This does not check whether a loop is currently active.
>>>> // If the PE were in a loop, would this be the last one?
>>>> return UInt(state.LoopCount) <= (1 << (4 - LTPSIZE));
>>>>
>>>> ```
>>>>
>>>> So architecturally the asm we output would be ok (except maybe the
>>>> "branch too far subs;bgt;lctp" fallback at
>>>> `predicated_doloop_end_internal` (maybe that should be `bhi`))... But
>>>> now GE: isn't looking like an accurate representation of this
>>>> operation in the compiler.
>>>>
>>>> I'm wondering if I should try to make
>>>> `predicated_doloop_end_internal` contain a comparison along the lines
>>>> of:
>>>> (gtu: (plus: (LR) (const_int -num_lanes)) (const_int 
>>>> num_lanes_minus_1))
>>>>
>>>> I'll give that a try :)
>>>>
>>>> The only reason I'd chosen to go with GE earlier, tbh, was because of
>>>> the existing handling of GE in loop-doloop.cc
>>>>
>>>> Let me know if any other ideas come to your mind!
>>>>
>>>>
>>>> Cheers,
>>>>
>>>> Stam
>>>
>>>
>>> It looks like I've had success with the below (diff to previous patch),
>>> trimmed a bit to only the functionally interesting things::
>>>
>>>
>>>
>>>
>>> diff --git a/gcc/config/arm/thumb2.md b/gcc/config/arm/thumb2.md
>>> index 368d5138ca1..54dd4ee564b 100644
>>> --- a/gcc/config/arm/thumb2.md
>>> +++ b/gcc/config/arm/thumb2.md
>>> @@ -1649,16 +1649,28 @@
>>>            && (decrement_num = arm_attempt_dlstp_transform 
>>> (operands[1]))
>>>            && (INTVAL (decrement_num) != 1))
>>>          {
>>> -          insn = emit_insn
>>> -              (gen_thumb2_addsi3_compare0
>>> -              (s0, s0, GEN_INT ((-1) * (INTVAL (decrement_num)))));
>>> -          cmp = XVECEXP (PATTERN (insn), 0, 0);
>>> -          cc_reg = SET_DEST (cmp);
>>> -          bcomp = gen_rtx_GE (VOIDmode, cc_reg, const0_rtx);
>>>            loc_ref = gen_rtx_LABEL_REF (VOIDmode, operands[1]);
>>> -          emit_jump_insn (gen_rtx_SET (pc_rtx,
>>> -                       gen_rtx_IF_THEN_ELSE (VOIDmode, bcomp,
>>> -                                 loc_ref, pc_rtx)));
>>> +          switch (INTVAL (decrement_num))
>>> +        {
>>> +          case 2:
>>> +            insn = emit_jump_insn (gen_predicated_doloop_end_internal2
>>> +                        (s0, loc_ref));
>>> +            break;
>>> +          case 4:
>>> +            insn = emit_jump_insn (gen_predicated_doloop_end_internal4
>>> +                        (s0, loc_ref));
>>> +            break;
>>> +          case 8:
>>> +            insn = emit_jump_insn (gen_predicated_doloop_end_internal8
>>> +                        (s0, loc_ref));
>>> +            break;
>>> +          case 16:
>>> +            insn = emit_jump_insn (gen_predicated_doloop_end_internal16
>>> +                        (s0, loc_ref));
>>> +            break;
>>> +          default:
>>> +            gcc_unreachable ();
>>> +        }
>>>            DONE;
>>>          }
>>>      }
>>>
>>> diff --git a/gcc/config/arm/mve.md b/gcc/config/arm/mve.md
>>> index 93905583b18..c083f965fa9 100644
>>> --- a/gcc/config/arm/mve.md
>>> +++ b/gcc/config/arm/mve.md
>>> @@ -6922,23 +6922,24 @@
>>>  ;; Originally expanded by 'predicated_doloop_end'.
>>>  ;; In the rare situation where the branch is too far, we do also 
>>> need to
>>>  ;; revert FPSCR.LTPSIZE back to 0x100 after the last iteration.
>>> -(define_insn "*predicated_doloop_end_internal"
>>> +(define_insn "predicated_doloop_end_internal<letp_num_lanes>"
>>>    [(set (pc)
>>>      (if_then_else
>>> -       (ge (plus:SI (reg:SI LR_REGNUM)
>>> -            (match_operand:SI 0 "const_int_operand" ""))
>>> -        (const_int 0))
>>> -     (label_ref (match_operand 1 "" ""))
>>> +       (gtu (unspec:SI [(plus:SI (match_operand:SI 0
>>> "s_register_operand" "=r")
>>> +                     (const_int <letp_num_lanes_neg>))]
>>> +        LETP)
>>> +        (const_int <letp_num_lanes_minus_1>))
>>> +     (match_operand 1 "" "")
>>>       (pc)))
>>> -   (set (reg:SI LR_REGNUM)
>>> -    (plus:SI (reg:SI LR_REGNUM) (match_dup 0)))
>>> +   (set (match_dup 0)
>>> +    (plus:SI (match_dup 0) (const_int <letp_num_lanes_neg>)))
>>>     (clobber (reg:CC CC_REGNUM))]
>>>    "TARGET_HAVE_MVE"
>>>    {
>>>      if (get_attr_length (insn) == 4)
>>>        return "letp\t%|lr, %l1";
>>>      else
>>> -      return "subs\t%|lr, #%n0\n\tbgt\t%l1\n\tlctp";
>>> +      return "subs\t%|lr, #<letp_num_lanes>\n\tbhi\t%l1\n\tlctp";
>>>    }
>>>    [(set (attr "length")
>>>      (if_then_else
>>> @@ -6947,11 +6948,11 @@
>>>          (const_int 6)))
>>>     (set_attr "type" "branch")])
>>>
>>> -(define_insn "dlstp<mode1>_insn"
>>> +(define_insn "dlstp<dlstp_elemsize>_insn"
>>>    [
>>>      (set (reg:SI LR_REGNUM)
>>>       (unspec:SI [(match_operand:SI 0 "s_register_operand" "r")]
>>>        DLSTP))
>>>    ]
>>>    "TARGET_HAVE_MVE"
>>> -  "dlstp.<mode1>\t%|lr, %0")
>>> +  "dlstp.<dlstp_elemsize>\t%|lr, %0")
>>>
>>> diff --git a/gcc/loop-doloop.cc b/gcc/loop-doloop.cc
>>> index 6a72700a127..47fdef989b4 100644
>>> --- a/gcc/loop-doloop.cc
>>> +++ b/gcc/loop-doloop.cc
>>> @@ -185,6 +185,7 @@ doloop_condition_get (rtx_insn *doloop_pat)
>>>        || XEXP (inc_src, 0) != reg
>>>        || !CONST_INT_P (XEXP (inc_src, 1)))
>>>      return 0;
>>> +  int dec_num = abs (INTVAL (XEXP (inc_src, 1)));
>>>
>>>    /* Check for (set (pc) (if_then_else (condition)
>>>                                         (label_ref (label))
>>> @@ -199,21 +200,32 @@ doloop_condition_get (rtx_insn *doloop_pat)
>>>    /* Extract loop termination condition.  */
>>>    condition = XEXP (SET_SRC (cmp), 0);
>>>
>>> -  /* We expect a GE or NE comparison with 0 or 1.  */
>>> -  if ((GET_CODE (condition) != GE
>>> -       && GET_CODE (condition) != NE)
>>> -      || (XEXP (condition, 1) != const0_rtx
>>> -          && XEXP (condition, 1) != const1_rtx))
>>> +  /* We expect a GE or NE comparison with 0 or 1, or a GTU comparison
>>> with
>>> +     dec_num - 1.  */
>>> +  if (!((GET_CODE (condition) == GE
>>> +     || GET_CODE (condition) == NE)
>>> +    && (XEXP (condition, 1) == const0_rtx
>>> +        || XEXP (condition, 1) == const1_rtx ))
>>> +      &&!(GET_CODE (condition) == GTU
>>> +      && ((INTVAL (XEXP (condition, 1))) == (dec_num - 1))))
>>>      return 0;
>>>
>>> -  if ((XEXP (condition, 0) == reg)
>>> +  /* For the ARM special case of having a GTU: re-form the condition
>>> without
>>> +     the unspec for the benefit of the middle-end.  */
>>> +  if (GET_CODE (condition) == GTU)
>>> +    {
>>> +      condition = gen_rtx_fmt_ee (GTU, VOIDmode, inc_src, GEN_INT
>>> (dec_num - 1));
>>> +      return condition;
>>> +    }
>>> +  else if ((XEXP (condition, 0) == reg)
>>>        /* For the third case:  */
>>>        || ((cc_reg != NULL_RTX)
>>>        && (XEXP (condition, 0) == cc_reg)
>>>        && (reg_orig == reg))
>>> @@ -245,20 +257,11 @@ doloop_condition_get (rtx_insn *doloop_pat)
>>>                         (label_ref (label))
>>>                         (pc))))])
>>>
>>> -    So we return the second form instead for the two cases when n == 1.
>>> -
>>> -    For n > 1, the final value may be exceeded, so use GE instead of 
>>> NE.
>>> +    So we return the second form instead for the two cases.
>>>       */
>>> -     if (GET_CODE (pattern) != PARALLEL)
>>> -       {
>>> -    if (INTVAL (XEXP (inc_src, 1)) != -1)
>>> -      condition = gen_rtx_fmt_ee (GE, VOIDmode, inc_src, const0_rtx);
>>> -    else
>>> -      condition = gen_rtx_fmt_ee (NE, VOIDmode, inc_src, const1_rtx);;
>>> -       }
>>> -
>>> +    condition = gen_rtx_fmt_ee (NE, VOIDmode, inc_src, const1_rtx);
>>>      return condition;
>>> -   }
>>> +    }
>>>
>>>    /* ??? If a machine uses a funny comparison, we could return a
>>>       canonicalized form here.  */
>>> @@ -501,7 +504,8 @@ doloop_modify (class loop *loop, class niter_desc
>>> *desc,
>>>      case GE:
>>>        /* Currently only GE tests against zero are supported. */
>>>        gcc_assert (XEXP (condition, 1) == const0_rtx);
>>> -
>>> +      /* FALLTHRU */
>>> +    case GTU:
>>>        noloop = constm1_rtx;
>>> diff --git a/gcc/config/arm/iterators.md b/gcc/config/arm/iterators.md
>>> index a6a7ff507a5..9398702cddd 100644
>>> --- a/gcc/config/arm/iterators.md
>>> +++ b/gcc/config/arm/iterators.md
>>> @@ -2673,8 +2673,16 @@
>>>  (define_int_attr mrrc [(VUNSPEC_MRRC "mrrc") (VUNSPEC_MRRC2 "mrrc2")])
>>>  (define_int_attr MRRC [(VUNSPEC_MRRC "MRRC") (VUNSPEC_MRRC2 "MRRC2")])
>>>
>>> -(define_int_attr mode1 [(DLSTP8 "8") (DLSTP16 "16") (DLSTP32 "32")
>>> -            (DLSTP64 "64")])
>>> +(define_int_attr dlstp_elemsize [(DLSTP8 "8") (DLSTP16 "16") (DLSTP32
>>> "32")
>>> +                 (DLSTP64 "64")])
>>> +
>>> +(define_int_attr letp_num_lanes [(LETP8 "16") (LETP16 "8") (LETP32 "4")
>>> +                 (LETP64 "2")])
>>> +(define_int_attr letp_num_lanes_neg [(LETP8 "-16") (LETP16 "-8")
>>> (LETP32 "-4")
>>> +                     (LETP64 "-2")])
>>> +
>>> +(define_int_attr letp_num_lanes_minus_1 [(LETP8 "15") (LETP16 "7")
>>> (LETP32 "3")
>>> +                     (LETP64 "1")])
>>>
>>>  (define_int_attr opsuffix [(UNSPEC_DOT_S "s8")
>>>                 (UNSPEC_DOT_U "u8")
>>> @@ -2921,6 +2929,8 @@
>>>  (define_int_iterator VQSHLUQ_N [VQSHLUQ_N_S])
>>>  (define_int_iterator DLSTP [DLSTP8 DLSTP16 DLSTP32
>>>                     DLSTP64])
>>> +(define_int_iterator LETP [LETP8 LETP16 LETP32
>>> +               LETP64])
>>>
>>>  ;; Define iterators for VCMLA operations
>>>  (define_int_iterator VCMLA_OP [UNSPEC_VCMLA
>>>        /* The iteration count does not need incrementing for a GE
>>> test.  */
>>> diff --git a/gcc/config/arm/unspecs.md b/gcc/config/arm/unspecs.md
>>> index 12ae4c4f820..2d6f27c14f4 100644
>>> --- a/gcc/config/arm/unspecs.md
>>> +++ b/gcc/config/arm/unspecs.md
>>> @@ -587,6 +587,10 @@
>>>    DLSTP16
>>>    DLSTP32
>>>    DLSTP64
>>> +  LETP8
>>> +  LETP16
>>> +  LETP32
>>> +  LETP64
>>>    VPNOT
>>>    VCREATEQ_F
>>>    VCVTQ_N_TO_F_S
>>>
>>>
>>> I've attached the whole [2/2] patch diff with this change and
>>> the required comment changes in doloop_condition_get.
>>> WDYT?
>>>
>>>
>>> Thanks,
>>>
>>> Stam
>>>
>>>
>>>>
>>>>

[I'm still working through this patch, but there are a number of things 
which clearly need addressing, so I'll stop at this point today.]

+arm_predict_doloop_p (struct loop *loop)
+{

Is it feasible to add something here to check the overall size of the 
loop, so that we don't try to convert loops that are clearly too big?

+static rtx_insn*
+arm_mve_dlstp_check_inc_counter (basic_block body, rtx_insn* vctp_insn,
+				 rtx condconst, rtx condcount)
...
+  rtx condcount_reg_set = PATTERN (DF_REF_INSN (condcount_reg_set_df));
+  rtx_insn* vctp_reg_set = DF_REF_INSN (vctp_reg_set_df);
+  /* Ensure the modification of the vctp reg from df is consistent with
+     the iv and the number of lanes on the vctp insn.  */
+  if (!(GET_CODE (XEXP (PATTERN (vctp_reg_set), 1)) == PLUS
+      && REGNO (XEXP (PATTERN (vctp_reg_set), 0))
+	  == REGNO (XEXP (XEXP (PATTERN (vctp_reg_set), 1), 0))))

You seem to be assuming that the pattern in insn you've found will 
always be of the form (set x y), but that's unsafe.  When scanning RTL 
you must first check that you do have a genuine single set.  The easiest 
way to do that is to call single_set (insn), which returns the SET RTL 
if it is a single set, and NULL otherwise; it is also able to handle 
irrelevant clobbers which might be added for register allocation purposes.
Also, rather than using XEXP, you should use SET_SRC and SET_DEST when 
looking at the arm's of a SET operation, for better clarity.

+	    if (!single_set (last_set_insn))
+	      return NULL;
+	    rtx counter_orig_set;
+	    counter_orig_set = XEXP (PATTERN (last_set_insn), 1);

There's a similar problem here, in that single_set returns a valid value 
for something like
   (parallel [(set a b)
              (clobber scratch_reg)])

so looking directly at the pattern of the insn is wrong.  Instead you 
should use the value returned by single_set for further analysis.

+  /* Next, ensure that it is a PLUS of the form:
+     (set (reg a) (plus (reg a) (const_int)))
+     where (reg a) is the same as condcount.  */
+  if (GET_CODE (XEXP (PATTERN (dec_insn), 1)) == PLUS
+      && REGNO (XEXP (PATTERN (dec_insn), 0))
+	  == REGNO (XEXP (XEXP (PATTERN (dec_insn), 1), 0))
+      && REGNO (XEXP (PATTERN (dec_insn), 0)) == REGNO (condcount))

and again, you need to validate that dec_insn is a set first.  There are 
several other cases where you need to check for a SET as well, but I 
won't mention any more here.

Can this code be run before register allocation? If so, there's a risk 
that we will have different pseudos for the source and dest operands 
here, but expect register allocation to tie them back together; 
something like

t1 = count
loop_head:
   ...
   t2 = t1
   ...
   t1 = t2 + const
   if (t1 < end)
      goto loop_head;

Register allocation would be expected to eliminate the t2 = t1 insn by 
allocating the same physical register here.


+    decrementnum = abs (INTVAL (XEXP (XEXP (PATTERN (dec_insn), 1), 1)));

why the use of abs()?  I don't see where we validate that the direction 
really matches our expectation.

+      if (abs (INTVAL (vctp_reg_iv.step))
+	  != arm_mve_get_vctp_lanes (PATTERN (vctp_insn)))

And again, we only look at the absolute value.  Shouldn't we be 
validating the sign here against the sign we ignored earlier?

+  if (!(TARGET_32BIT && TARGET_HAVE_LOB && optimize > 0))

Again, the test for TARGET_32BIT seems redundant.

+  if (single_set (next_use1)
+      && GET_CODE (SET_SRC (single_set (next_use1))) == ZERO_EXTEND)

Don't call single_set twice, just save the result of the first call and 
use that.  Perhaps
   rtx next_use1_set = single_set (next_use1);
   if (next_use1_set && ...)

+/* Attempt to transform the loop contents of loop basic block from VPT
+   predicated insns into unpredicated insns for a dlstp/letp loop.  */
+
+rtx
+arm_attempt_dlstp_transform (rtx label)

Please describe what is returned in the comment.  I'm guessing it's some 
form of iteration count, but why return 1 on failure?

+    return GEN_INT (1);

It's more efficient to write "return const1_rtx;".  In fact, it looks 
like this function always returns a CONST_INT and is only ever called 
from one place in thumb2.md, where we only ever look at the integer 
value.  So why not make it return a HOST_WIDE_INT in the first place?


+static bool
+arm_emit_mve_unpredicated_insn_to_seq (rtx_insn* insn)
+{
+
I think this function needs to also copy across the INSN_LOCATION 
information from the insn it is rewriting; that way we keep any 
diagnostic information and debug information more accurate.  Something like:

+    }
 >> INSN_LOCATION (new_insn) = INSN_LOCATION (insn);
+  return true;

Will be enough if we never get more than one insn emitted from the 
emit_insn sequence above (emit_insn returns new_insn, which you'll need 
to save).  If you need something more complex, then perhaps you could use
   emit_insn_after_setloc (GEN_FCN..., get_last_insn (),
			  INSN_LOCATION (insn));

+  for (insn = seq; NEXT_INSN (insn); insn = NEXT_INSN (insn))
+    if (NOTE_P (insn))
+      emit_note_after ((enum insn_note)NOTE_KIND (insn), BB_END (body));
+    else if (DEBUG_INSN_P (insn))
+      emit_debug_insn_after (PATTERN (insn), BB_END (body));
+    else
+      emit_insn_after (PATTERN (insn), BB_END (body));
+
+

I'm not sure I follow why you can't replace this entire loop with

   emit_insn_after (seq, BB_END (body));

which should do all of the above for you.  But there's another problem here:

+    if (NOTE_P (insn))
+      emit_note_after ((enum insn_note)NOTE_KIND (insn), BB_END (body));

Notes have data (see emit_note_copy in emit-rtl.cc).  If you can't use 
that function, you'll need to copy the note data manually in the same 
way as emit_note_copy does (or add emit_note_copy_after() as a new 
function in emit-rtl.cc).

I also note that you're already copying the note in 
arm_attempt_dlstp_transform and that also drops the note's data, but 
that really can use emit_note_copy().

thumb2.md:
+	  /* If we have a compatibe MVE target, try and analyse the loop

Typo: compatible

+++ b/gcc/testsuite/gcc.target/arm/mve/dlstp-int16x8.c
@@ -0,0 +1,69 @@
+/* { dg-do run { target { arm*-*-* } } } */
+/* { dg-require-effective-target arm_v8_1m_mve_ok } */
+/* { dg-require-effective-target arm_mve_hw } */
+/* { dg-options "-O2 -save-temps" } */
+/* { dg-add-options arm_v8_1m_mve } */
+

...

+/* { dg-final { scan-assembler-times {\tdlstp.16} 1 } } */

Please do not mix scan-assembler tests with execution tests.  The latter 
require specific hardware to run, while the former do not, that means 
these tests get run by far fewer configurations of the compiler.





More information about the Gcc-patches mailing list