[PATCH v4 01/10] Initial TI PRU GCC port

Sat Sep 22 10:23:00 GMT 2018

On Thursday, 9/13/2018 13:02:21 EEST Richard Sandiford wrote:
> Dimitar Dimitrov <dimitar@dinux.eu> writes:
> > +(define_insn
> > "sub_impl<EQD:mode><EQS0:mode><EQS1:mode>_<alu3_zext><alu3_zext_op1><alu3
> > _zext_op2>" +  [(set (match_operand:EQD 0 "register_operand" "=r,r,r")
> > +	(minus:EQD
> > +	 (zero_extend:EQD
> > +	  (match_operand:EQS0 1 "reg_or_ubyte_operand" "r,r,I"))
> > +	 (zero_extend:EQD
> > +	  (match_operand:EQS1 2 "reg_or_ubyte_operand" "r,I,r"))))]
> > +  ""
> > +  "@
> > +   sub\\t%0, %1, %2
> > +   sub\\t%0, %1, %2
> > +   rsb\\t%0, %2, %1"
> > +  [(set_attr "type" "alu")
> > +   (set_attr "length" "4")])
> 
> By convention, subtraction patterns shouldn't accept constants for
> operand 2.  Constants should instead be subtracted using an addition
> of the negative.
Understood. I will remove second alternative. But I will leave the third one 
since it enables an important optimization:

   unsigned test(unsigned a)
   {
        return 10-a;
   }

RTL:
(insn 6 3 7 2 (set (reg:SI 152)
        (minus:SI (const_int 10 [0xa])
            (reg/v:SI 151 [ a ]))) "test.c":430 -1
     (nil))

Assembly:
    test:
        rsb     r14, r14, 10
        ret

> 
> > +(define_constraint "I"
> > +  "An unsigned 8-bit constant."
> > +  (and (match_code "const_int")
> > +       (match_test "UBYTE_INT (ival)")))
> 
> As it stands this will reject QImode constants with the top bit set,
> since const_ints are always stored in sign-extended form.  E.g. QImode
> 128 is stored as (const_int -128) rather than (const_int 128).
> 
> Unfortunately this is difficult to fix in a clean way, since
> const_ints don't store their mode (a long-standing wart) and unlike
> predicates, constraints aren't given a mode from context.  The best
> way I can think of coping with it is:
> 
> a) have a separate constraint for -128...127
> b) add a define_mode_attr that maps QI to the new constraint and
>    HI and SI to I
> c) use <EQS1:...> etc. instead of I in the match_operands
> 
> Similar comment for "J" and HImode, although you already have the
> "N" as the corresponding signed constraint and so don't need a new one.
Thank you. This strategy worked for QImode. I will include the changes in my 
next patch.

Since PRU ALU operations do not have 16-bit constant operands, there is no 
need to add define_mode_attr for HImode. The "mov" pattern already handles the 
"N" [-32768, 32767] constraint.

> 
> 
> > +;; Return true if OP is a text segment reference.
> > +;; This is needed for program memory address expressions.  Borrowed from
> > AVR. +(define_predicate "text_segment_operand"
> > +  (match_code "code_label,label_ref,symbol_ref,plus,minus,const")
> > +{
> > +  switch (GET_CODE (op))
> > +    {
> > +    case CODE_LABEL:
> > +      return true;
> > +    case LABEL_REF :
> > +      return true;
> > +    case SYMBOL_REF :
> > +      return SYMBOL_REF_FUNCTION_P (op);
> > +    case PLUS :
> > +    case MINUS :
> > +      /* Assume canonical format of symbol + constant.
> > +	 Fall through.  */
> > +    case CONST :
> > +      return text_segment_operand (XEXP (op, 0), VOIDmode);
> > +    default :
> > +      return false;
> > +    }
> > +})
> 
> This probably comes from AVR, but: no spaces before ":".
> 
> Bit surprised that we can get a CODE_LABEL rather than a LABEL_REF here.
> Do you know if that triggers in practice, and if so where?
Indeed, CODE_LABEL case is never reached. I'll leave gcc_unreachable here.

> An IMO neater and slightly more robust way of writing the body is:
>   poly_int64 offset:
>   rtx base = strip_offset (op, &offset);
>   switch (GET_CODE (base))
>   
>     {
>     
>     case LABEL_REF:
>       ...as above...
>     
>     case SYMBOL_REF:
>       ...as above...
>     
>     default:
>       return false;
>     
>     }
> 
> with "plus" and "minus" not in the match_code list (since they should
> always appear in consts if they really are text references).

The "plus" and "minus" are needed when handling code labels as values. Take 
for example the following construct:

   int x = &&lab1 - &&lab0;
lab1:
  ...
lab2:

My TARGET_ASM_INTEGER callback uses the text_segment_operand predicate. In the 
above case it is passed the following RTL expression:
(minus:SI (label_ref/v:SI 20)
    (label_ref/v:SI 27))

I need to detect text labels so that I annotate them with %pmem:
        .4byte	%pmem(.L4-(.L2))
Instead of the incorrect:
        .4byte   .L3-(.L2)

> > +;; Return true if OP is a load multiple operation.  It is known to be a
> > +;; PARALLEL and the first section will be tested.
> > +
> > +(define_special_predicate "load_multiple_operation"
> > +  (match_code "parallel")
> > +{
> > +  machine_mode elt_mode;
> > +  int count = XVECLEN (op, 0);
> > +  unsigned int dest_regno;
> > +  rtx src_addr;
> > +  int i, off;
> > +
> > +  /* Perform a quick check so we don't blow up below.  */
> > +  if (GET_CODE (XVECEXP (op, 0, 0)) != SET
> > +      || GET_CODE (SET_DEST (XVECEXP (op, 0, 0))) != REG
> > +      || GET_CODE (SET_SRC (XVECEXP (op, 0, 0))) != MEM)
> > +    return false;
> > +
> > +  dest_regno = REGNO (SET_DEST (XVECEXP (op, 0, 0)));
> > +  src_addr = XEXP (SET_SRC (XVECEXP (op, 0, 0)), 0);
> > +  elt_mode = GET_MODE (SET_DEST (XVECEXP (op, 0, 0)));
> > +
> > +  /* Check, is base, or base + displacement.  */
> > +
> > +  if (GET_CODE (src_addr) == REG)
> > +    off = 0;
> > +  else if (GET_CODE (src_addr) == PLUS
> > +	   && GET_CODE (XEXP (src_addr, 0)) == REG
> > +	   && GET_CODE (XEXP (src_addr, 1)) == CONST_INT)
> > +    {
> > +      off = INTVAL (XEXP (src_addr, 1));
> > +      src_addr = XEXP (src_addr, 0);
> > +    }
> > +  else
> > +    return false;
> > +
> > +  for (i = 1; i < count; i++)
> > +    {
> > +      rtx elt = XVECEXP (op, 0, i);
> > +
> > +      if (GET_CODE (elt) != SET
> > +	  || GET_CODE (SET_DEST (elt)) != REG
> > +	  || GET_MODE (SET_DEST (elt)) != elt_mode
> > +	  || REGNO (SET_DEST (elt)) != dest_regno + i
> > +	  || GET_CODE (SET_SRC (elt)) != MEM
> > +	  || GET_MODE (SET_SRC (elt)) != elt_mode
> > +	  || GET_CODE (XEXP (SET_SRC (elt), 0)) != PLUS
> > +	  || ! rtx_equal_p (XEXP (XEXP (SET_SRC (elt), 0), 0), src_addr)
> > +	  || GET_CODE (XEXP (XEXP (SET_SRC (elt), 0), 1)) != CONST_INT
> > +	  || INTVAL (XEXP (XEXP (SET_SRC (elt), 0), 1))
> > +	     != off + i * GET_MODE_SIZE (elt_mode))
> > +	return false;
> > +    }
> > +
> > +  return true;
> > +})
> 
> This may not matter in practice, but the downside of testing PLUS and
> CONST_INT explicitly is that the predicate won't match something like:
> 
>   (set (reg R1) (mem (plus (reg RB) (const_int -4))))
>   (set (reg R2) (mem (reg RB)))
>   (set (reg R3) (mem (plus (reg RB) (const_int 4))))
>   ...
> 
> Using strip_offset on the addresses would avoid this.
It does not matter for PRU since all offsets are unsigned integers. But I will 
switch to strip_offset to make the code cleaner.

> > +/* Callback for walk_gimple_seq that checks TP tree for TI ABI
> > compliance.  */ +static tree
> > +check_op_callback (tree *tp, int *walk_subtrees, void *data)
> > +{
> > +  struct walk_stmt_info *wi = (struct walk_stmt_info *) data;
> > +
> > +  if (RECORD_OR_UNION_TYPE_P (*tp) || TREE_CODE (*tp) == ENUMERAL_TYPE)
> > +    {
> > +      /* Forward declarations have NULL tree type.  Skip them.  */
> > +      if (TREE_TYPE (*tp) == NULL)
> > +	return NULL;
> > +    }
> > +
> > +  /* TODO - why C++ leaves INTEGER_TYPE forward declarations around?  */
> > +  if (TREE_TYPE (*tp) == NULL)
> > +    return NULL;
> > +
> > +  const tree type = TREE_TYPE (*tp);
> > +
> > +  /* Direct function calls are allowed, obviously.  */
> > +  if (TREE_CODE (*tp) == ADDR_EXPR && TREE_CODE (type) == POINTER_TYPE)
> > +    {
> > +      const tree ptype = TREE_TYPE (type);
> > +      if (TREE_CODE (ptype) == FUNCTION_TYPE)
> > +	return NULL;
> > +    }
> 
> This seems like a bit of a dangerous heuristic.  Couldn't it also cause
> us to skip things like:
> 
>    (void *) func
> 
> ?  (OK, that's dubious C, but we do support it.)
The cast yields a "data pointer", which is 32 bits for both types of ABI. 
Hence it is safe to skip "(void *) func".

The TI ABI's 16 bit function pointers become a problem when they change the 
layout of structures and function argument registers.

> > +/* Implement TARGET_MODES_TIEABLE_P.  */
> > +
> > +static bool
> > +pru_modes_tieable_p (machine_mode mode1, machine_mode mode2)
> > +{
> > +  return (mode1 == mode2
> > +	  || (GET_MODE_SIZE (mode1) <= 4 && GET_MODE_SIZE (mode2) <= 4));
> > +}
> 
> I think this should have a comment explaining why this implementation
> is needed.
Sorry, this is stale. I will altogether remove the callback, and will use the 
default "true" implementation. PRU registers are fairly orthogonal.

A few years ago I got confused and considered this a requirement for correct 
64 bit operations.

> 
> > +	    {
> > +	      *total = 0;
> > +	    }
> > +	  else
> > +	    {
> > +	      /* SI move has the same cost as a QI move.  */
> > +	      int factor = GET_MODE_SIZE (mode) / GET_MODE_SIZE (SImode);
> > +	      if (factor == 0) 
> > +		factor = 1;
> > +	      *total = factor * COSTS_N_INSNS (1);
> > +	    }
> 
> Could you explain this a bit more?  It looks like a pure QImode operation
> gets a cost of 1 insn but an SImode operation zero-extended from QImode
> gets a cost of 0.
I have unintentionally bumped QI cost while trying to make zero-extends cheap. 
I will fix by using the following simple logic:

    case SET:
	  mode = GET_MODE (SET_DEST (x));
	  /* SI move has the same cost as a QI move.  Moves larger than
	     64 bits are costly.  */
	  int factor = CEIL (GET_MODE_SIZE (mode), GET_MODE_SIZE (SImode));
	  *total = factor * COSTS_N_INSNS (1);

> 
> [...]
> 
> > +/* Implement TARGET_PREFERRED_RELOAD_CLASS.  */
> > +static reg_class_t
> > +pru_preferred_reload_class (rtx x ATTRIBUTE_UNUSED, reg_class_t regclass)
> > +{
> > +  return regclass == NO_REGS ? GENERAL_REGS : regclass;
> > +}
> 
> This looks odd: PREFERRED_RELOAD_CLASS should return a subset of the
> original class rather than add new registers.  Can you remember why
> it was needed?
I'm not sure what target code is supposed to return when NO_REGS is passed 
here. I saw how other ports handle NO_REGS (c-sky, m32c, nios2, rl78). So I 
followed suite.

Here is a backtrace of LRA when NO_REGS is passed:
0xf629ae pru_preferred_reload_class
        ../../gcc/gcc/config/pru/pru.c:788
0xa3d6e8 process_alt_operands
        ../../gcc/gcc/lra-constraints.c:2600
0xa3ef7d curr_insn_transform
        ../../gcc/gcc/lra-constraints.c:3889
0xa4301e lra_constraints(bool)
        ../../gcc/gcc/lra-constraints.c:4906
0xa2726c lra(_IO_FILE*)
        ../../gcc/gcc/lra.c:2446
0x9c97a9 do_reload
        ../../gcc/gcc/ira.c:5469
0x9c97a9 execute
        ../../gcc/gcc/ira.c:5653

> 
> > +  /* Floating-point comparisons.  */
> > +  eqsf_libfunc = init_one_libfunc ("__pruabi_eqf");
> > +  nesf_libfunc = init_one_libfunc ("__pruabi_neqf");
> > +  lesf_libfunc = init_one_libfunc ("__pruabi_lef");
> > +  ltsf_libfunc = init_one_libfunc ("__pruabi_ltf");
> > +  gesf_libfunc = init_one_libfunc ("__pruabi_gef");
> > +  gtsf_libfunc = init_one_libfunc ("__pruabi_gtf");
> > +  eqdf_libfunc = init_one_libfunc ("__pruabi_eqd");
> > +  nedf_libfunc = init_one_libfunc ("__pruabi_neqd");
> > +  ledf_libfunc = init_one_libfunc ("__pruabi_led");
> > +  ltdf_libfunc = init_one_libfunc ("__pruabi_ltd");
> > +  gedf_libfunc = init_one_libfunc ("__pruabi_ged");
> > +  gtdf_libfunc = init_one_libfunc ("__pruabi_gtd");
> > +
> > +  set_optab_libfunc (eq_optab, SFmode, NULL);
> > +  set_optab_libfunc (ne_optab, SFmode, "__pruabi_neqf");
> > +  set_optab_libfunc (gt_optab, SFmode, NULL);
> > +  set_optab_libfunc (ge_optab, SFmode, NULL);
> > +  set_optab_libfunc (lt_optab, SFmode, NULL);
> > +  set_optab_libfunc (le_optab, SFmode, NULL);
> > +  set_optab_libfunc (unord_optab, SFmode, "__pruabi_unordf");
> > +  set_optab_libfunc (eq_optab, DFmode, NULL);
> > +  set_optab_libfunc (ne_optab, DFmode, "__pruabi_neqd");
> > +  set_optab_libfunc (gt_optab, DFmode, NULL);
> > +  set_optab_libfunc (ge_optab, DFmode, NULL);
> > +  set_optab_libfunc (lt_optab, DFmode, NULL);
> > +  set_optab_libfunc (le_optab, DFmode, NULL);
> 
> So you need to pass NULL for the ordered comparisons because the ABI
> routines return a different value from the one GCC expects?  Might be
> worth a comment if so.
Yes, I will comment.

> 
> > +/* Emit comparison instruction if necessary, returning the expression
> > +   that holds the compare result in the proper mode.  Return the
> > comparison +   that should be used in the jump insn.  */
> > +
> > +rtx
> > +pru_expand_fp_compare (rtx comparison, machine_mode mode)
> > +{
> > +  enum rtx_code code = GET_CODE (comparison);
> > +  rtx op0 = XEXP (comparison, 0);
> > +  rtx op1 = XEXP (comparison, 1);
> > +  rtx cmp;
> > +  enum rtx_code jump_code = code;
> > +  machine_mode op_mode = GET_MODE (op0);
> > +  rtx_insn *insns;
> > +  rtx libfunc;
> > +
> > +  gcc_assert (op_mode == DFmode || op_mode == SFmode);
> > +
> > +  if (code == UNGE)
> > +    {
> > +      code = LT;
> > +      jump_code = EQ;
> > +    }
> > +  else if (code == UNLE)
> > +    {
> > +      code = GT;
> > +      jump_code = EQ;
> > +    }
> 
> This is only safe if the LT and GT libcalls don't raise exceptions for
> unordered inputs.  I'm guessing they don't, but it's probably worth a
> comment if so.
Yes, I will comment.

> 
> > +/* Return true if register REGNO is a valid base register.
> > +   STRICT_P is true if REG_OK_STRICT is in effect.  */
> > +
> > +bool
> > +pru_regno_ok_for_base_p (int regno, bool strict_p)
> > +{
> > +  if (!HARD_REGISTER_NUM_P (regno))
> > +    {
> > +      if (!strict_p)
> > +	return true;
> > +
> > +      if (!reg_renumber)
> > +	return false;
> > +
> > +      regno = reg_renumber[regno];
> 
> reg_renumber is a reload-only thing and shouldn't be needed/used on
> LRA targets.
I will remove the snippet. It's a remnant from the PRU's reload days.

> 
> > +  if (REG_P (base)
> > +      && pru_regno_ok_for_base_p (REGNO (base), strict_p)
> > +      && (offset == NULL_RTX
> > +	  || (CONST_INT_P (offset)
> > +	      && pru_valid_const_ubyte_offset (mode, INTVAL (offset)))
> > +	  || (REG_P (offset)
> > +	      && pru_regno_ok_for_index_p (REGNO (offset), strict_p))))
> > +    {
> > +      /*     base register + register offset
> > +       * OR  base register + UBYTE constant offset.  */
> > +      return true;
> > +    }
> > +  else if (REG_P (base)
> > +	   && pru_regno_ok_for_index_p (REGNO (base), strict_p)
> > +	   && (offset != NULL_RTX && ctable_base_operand (offset, VOIDmode)))
> > +    {
> > +      /*     base CTABLE constant base + register offset
> > +       * Note: GCC always puts the register as a first operand of PLUS. 
> > */ +      return true;
> > +    }
> > +  else if (CONST_INT_P (base)
> > +	   && offset == NULL_RTX
> > +	   && (ctable_addr_operand (base, VOIDmode)))
> > +    {
> > +      /*     base CTABLE constant base + UBYTE constant offset.  */
> > +      return true;
> 
> This address should simply be a (const_int ...), with no containing (plus
> ...).
Yes, my bad. Will remove this snippet.

> > +    case SUBREG:
> > +    case MEM:
> > +      if (letter == 0)
> > +	{
> > +	  output_address (VOIDmode, op);
> > +	  return;
> > +	}
> 
> Does the SUBREG case ever hit?  They shouldn't appear as operands
> at this late stage.
No, it does not hit. I will put gcc_unreachable().

> > +/* Helper function to get the starting storage HW register for an
> > argument, +   or -1 if it must be passed on stack.  The cum_v state is
> > not changed.  */ +static int
> > +pru_function_arg_regi (cumulative_args_t cum_v,
> > +		       machine_mode mode, const_tree type,
> > +		       bool named)
> > +{
> > +  CUMULATIVE_ARGS *cum = get_cumulative_args (cum_v);
> > +  size_t argsize = pru_function_arg_size (mode, type);
> > +  size_t i, bi;
> > +  int regi = -1;
> > +
> > +  if (!pru_arg_in_reg_bysize (argsize))
> > +    return -1;
> > +
> > +  if (!named)
> > +    return -1;
> > +
> > +  /* Find the first available slot that fits.  Yes, that's the PRU ABI. 
> > */ +  for (i = 0; regi < 0 && i < ARRAY_SIZE (cum->regs_used); i++)
> > +    {
> > +      if (mode == BLKmode)
> > +	{
> > +	  /* Structs are passed, beginning at a full register.  */
> > +	  if ((i % 4) != 0)
> > +	    continue;
> > +	}
> 
> The comment doesn't seem to match the code: a struct like:
> 
>   struct s { short x; };
> 
> will have HImode rather than BLKmode.
I will update the comment to clarify it is only for large structs. Code should 
be ok.

> 
> It's usually better to base ABI stuff on the type where possible,
> since that corresponds directly to the source language, while modes
> are more of an internal GCC thing.
I will explore this option. But ABI is already defined in data sizes, which I 
think neatly fit with GCC's modes.

> 
> > +(define_insn "subdi3"
> > +  [(set (match_operand:DI 0 "register_operand"		    "=&r,&r,&r")
> > +	(minus:DI (match_operand:DI 1 "register_operand"    "r,r,I")
> > +		 (match_operand:DI 2 "reg_or_ubyte_operand" "r,I,r")))]
> > +  ""
> > +  "@
> > +   sub\\t%F0, %F1, %F2\;suc\\t%N0, %N1, %N2
> > +   sub\\t%F0, %F1, %2\;suc\\t%N0, %N1, 0
> > +   rsb\\t%F0, %F2, %1\;rsc\\t%N0, %N2, 0"
> > +  [(set_attr "type" "alu,alu,alu")
> > +   (set_attr "length" "8,8,8")])
> 
> As above, this shouldn't handle operand 2 being constant; that should
> always go through adddi3 instead.
I will remove alternative 2, but will leave 3.

> 
> > +; LRA cannot cope with clobbered op2, hence the scratch register.
> > +(define_insn "ashr<mode>3"
> > +  [(set (match_operand:QISI 0 "register_operand"	    "=&r,r")
> > +	  (ashiftrt:QISI
> > +	    (match_operand:QISI 1 "register_operand"	    "0,r")
> > +	    (match_operand:QISI 2 "reg_or_const_1_operand"  "r,P")))
> > +   (clobber (match_scratch:QISI 3			    "=r,X"))]
> > +  ""
> > +  "@
> > +   mov %3, %2\;ASHRLP%=:\;qbeq ASHREND%=, %3, 0\; sub %3, %3, 1\;
> > lsr\\t%0, %0, 1\; qbbc ASHRLP%=, %0, (%S0 * 8) - 2\; set %0, %0, (%S0 *
> > 8) - 1\; jmp ASHRLP%=\;ASHREND%=: +   lsr\\t%0, %1, 1\;qbbc LSIGN%=, %0,
> > (%S0 * 8) - 2\;set %0, %0, (%S0 * 8) - 1\;LSIGN%=:" +  [(set_attr "type"
> > "complex,alu")
> > +   (set_attr "length" "28,4")])
> 
> What do you mean by LRA not coping?  What did you try originally and
> what went wrong?
Better assembly could be generated if the shift count register was used for a 
loop counter. Pseudo code:

   while (op2--)
     op0 >>= 1;

I originally tried to clobber operand 2:
 (define_insn "ashr<mode>3"
   [(set (match_operand:QISI 0 "register_operand" "=&r,r")
         (ashiftrt:QISI
           (match_operand:QISI 1 "register_operand" "0,r")
           (match_operand:QISI 2 "reg_or_const_1_operand" "+r,P"))
          )]

But with the above pattern operand 2 was not clobbered. Its value was deemed 
untouched (i.e. live across the pattern). So I came up with  clobbering a 
separate register to fix this, at the expense of slightly bigger generated 
code.

> 
> > +(define_insn "<code>di3"
> > +  [(set (match_operand:DI 0 "register_operand"		"=&r,&r")
> > +	  (LOGICAL_BITOP:DI
> > +	    (match_operand:DI 1 "register_operand"	"%r,r")
> > +	    (match_operand:DI 2 "reg_or_ubyte_operand"	"r,I")))]
> > +  ""
> > +  "@
> > +   <logical_bitop_asm>\\t%F0, %F1, %F2\;<logical_bitop_asm>\\t%N0, %N1,
> > %N2 +   <logical_bitop_asm>\\t%F0, %F1, %2\;<logical_bitop_asm>\\t%N0,
> > %N1, 0" +  [(set_attr "type" "alu,alu")
> > +   (set_attr "length" "8,8")])
> > +
> > +
> > +(define_insn "one_cmpldi2"
> > +  [(set (match_operand:DI 0 "register_operand"		"=r")
> > +	(not:DI (match_operand:DI 1 "register_operand"	"r")))]
> > +  ""
> > +{
> > +  /* careful with overlapping source and destination regs.  */
> > +  gcc_assert (GP_REG_P (REGNO (operands[0])));
> > +  gcc_assert (GP_REG_P (REGNO (operands[1])));
> > +  if (REGNO (operands[0]) == (REGNO (operands[1]) + 4))
> > +    return "not\\t%N0, %N1\;not\\t%F0, %F1";
> > +  else
> > +    return "not\\t%F0, %F1\;not\\t%N0, %N1";
> > +}
> > +  [(set_attr "type" "alu")
> > +   (set_attr "length" "8")])
> 
> Do you see any code improvements from defining these patterns?
> These days I'd expect you'd get better code by letting the
> target-independent code split them up into SImode operations.
Yes, these patterns improve code size and speed.

Looking at expand_binop(), DI logical ops are split into WORD mode, which in 
PRU's peculiar case is QI. So without those patterns, a 64 bit IOR generates 8 
QI instructions:

        or      r0.b0, r14.b0, r16.b0
        or      r0.b1, r14.b1, r16.b1
        or      r0.b2, r14.b2, r16.b2
        or      r0.b3, r14.b3, r16.b3
        or      r1.b0, r15.b0, r17.b0
        or      r1.b1, r15.b1, r17.b1
        or      r1.b2, r15.b2, r17.b2
        or      r1.b3, r15.b3, r17.b3

Whereas with patterns defined, it is only 2 SImode instructions:
        or      r0, r14, r16
        or      r1, r15, r17

> 
> > +;; Multiply instruction.  Idea for fixing registers comes from the AVR
> > backend. +
> > +(define_expand "mulsi3"
> > +  [(set (match_operand:SI 0 "register_operand" "")
> > +	(mult:SI (match_operand:SI 1 "register_operand" "")
> > +		 (match_operand:SI 2 "register_operand" "")))]
> > +  ""
> > +{
> > +  emit_insn (gen_mulsi3_fixinp (operands[0], operands[1], operands[2]));
> > +  DONE;
> > +})
> > +
> > +
> > +(define_expand "mulsi3_fixinp"
> > +  [(set (reg:SI 112) (match_operand:SI 1 "register_operand" ""))
> > +   (set (reg:SI 116) (match_operand:SI 2 "register_operand" ""))
> > +   (set (reg:SI 104) (mult:SI (reg:SI 112) (reg:SI 116)))
> > +   (set (match_operand:SI 0 "register_operand" "") (reg:SI 104))]
> > +  ""
> > +{
> > +})
> 
> This seems slightly dangerous since there's nothing to guarantee that
> those registers aren't already live at the point of expansion.
> 
> The more usual way (and IMO correct way) would be for:
> > +(define_insn "*mulsi3_prumac"
> > +  [(set (reg:SI 104) (mult:SI (reg:SI 112) (reg:SI 116)))]
> > +  ""
> > +  "nop\;xin\\t0, r26, 4"
> > +  [(set_attr "type" "alu")
> > +   (set_attr "length" "8")])
> 
> to have register classes and constraints for these three registers,
> like e.g. the x86 port does for %cx etc.
> 
> It would be good if you could try that.  On the other hand, if AVR
> already does this and it worked in practice then I guess it's not
> something that should hold up the port.
This suggestion worked. It was a matter of correct predicates.

Here is what I will put in the next patch version:

(define_predicate "pru_muldst_operand"
  (and (match_code "reg")
       (ior (match_test "REGNO_REG_CLASS (REGNO (op)) == MULDST_REGS")
	    (match_test "REGNO (op) >= FIRST_PSEUDO_REGISTER"))))
...
(define_register_constraint "Rmd0" "MULDST_REGS"
  "@internal
  The multiply destination register.")
...

(define_insn "mulsi3"
  [(set (match_operand:SI 0 "pru_muldst_operand"	   "=Rmd0")
	(mult:SI (match_operand:SI 1 "pru_mulsrc0_operand" "Rms0")
		 (match_operand:SI 2 "pru_mulsrc1_operand" "Rms1")))]

> 
> > @@ -23444,6 +23449,56 @@ the offset with a symbol reference to a canary in
> > the TLS block.> 
> >  @end table
> > 
> > +@node PRU Options
> > +@subsection PRU Options
> > +@cindex PRU Options
> > +
> > +These command-line options are defined for PRU target:
> > +
> > +@table @gcctabopt
> > +@item -minrt
> > +@opindex minrt
> > +Enable the use of a minimum runtime environment---no static
> > +initializers or constructors.  Results in significant code size
> > +reduction of final ELF binaries.
> 
> Up to you, but this read to me as "-m"+"inrt".  If this option is already
> in use then obviously keep it, but if it's new then it might be worth
> calling it -mminrt instead, for consistency with -mmcu.
I assumed this is a standard naming since MSP430 already uses it. I've seen 
the option being utilized by pru-gcc users, so let's keep it that way.

> > +@item T
> > +A text segment (program memory) constant label.
> > +
> > +@item Z
> > +Integer constant zero.
> 
> Just checking, but did you deliberately leave out N?  That's fine if so,
> but the docstring in constraints.md didn't have an "@internal" marker to
> indicate it was internal-only.
I did not know about "@internal. I will annotate appropriately.

> 
> Despite the long reply (sorry), this looks in really good shape to me FWIW.
> Thanks for the submission.
I take this sentence as an indication that the PRU port is welcome for 
inclusion into GCC. I'll work on fixing the comments and will send a new patch 
version.

Thank you for the in-depth review and valuable suggestions. 

Regards,
Dimitar