[PATCH v4 01/10] Initial TI PRU GCC port

Mon Sep 24 10:49:00 GMT 2018

Dimitar Dimitrov <dimitar@dinux.eu> writes:
> On Thursday, 9/13/2018 13:02:21 EEST Richard Sandiford wrote:
>> Dimitar Dimitrov <dimitar@dinux.eu> writes:
>> > +(define_insn
>> > "sub_impl<EQD:mode><EQS0:mode><EQS1:mode>_<alu3_zext><alu3_zext_op1><alu3
>> > _zext_op2>" +  [(set (match_operand:EQD 0 "register_operand" "=r,r,r")
>> > +	(minus:EQD
>> > +	 (zero_extend:EQD
>> > +	  (match_operand:EQS0 1 "reg_or_ubyte_operand" "r,r,I"))
>> > +	 (zero_extend:EQD
>> > +	  (match_operand:EQS1 2 "reg_or_ubyte_operand" "r,I,r"))))]
>> > +  ""
>> > +  "@
>> > +   sub\\t%0, %1, %2
>> > +   sub\\t%0, %1, %2
>> > +   rsb\\t%0, %2, %1"
>> > +  [(set_attr "type" "alu")
>> > +   (set_attr "length" "4")])
>> 
>> By convention, subtraction patterns shouldn't accept constants for
>> operand 2.  Constants should instead be subtracted using an addition
>> of the negative.
> Understood. I will remove second alternative. But I will leave the third one 
> since it enables an important optimization:
>
>    unsigned test(unsigned a)
>    {
>         return 10-a;
>    }
>
> RTL:
> (insn 6 3 7 2 (set (reg:SI 152)
>         (minus:SI (const_int 10 [0xa])
>             (reg/v:SI 151 [ a ]))) "test.c":430 -1
>      (nil))
>
> Assembly:
>     test:
>         rsb     r14, r14, 10
>         ret

Thanks.  And yeah, the final alternative is fine.

>> > +;; Return true if OP is a text segment reference.
>> > +;; This is needed for program memory address expressions.  Borrowed from
>> > AVR. +(define_predicate "text_segment_operand"
>> > +  (match_code "code_label,label_ref,symbol_ref,plus,minus,const")
>> > +{
>> > +  switch (GET_CODE (op))
>> > +    {
>> > +    case CODE_LABEL:
>> > +      return true;
>> > +    case LABEL_REF :
>> > +      return true;
>> > +    case SYMBOL_REF :
>> > +      return SYMBOL_REF_FUNCTION_P (op);
>> > +    case PLUS :
>> > +    case MINUS :
>> > +      /* Assume canonical format of symbol + constant.
>> > +	 Fall through.  */
>> > +    case CONST :
>> > +      return text_segment_operand (XEXP (op, 0), VOIDmode);
>> > +    default :
>> > +      return false;
>> > +    }
>> > +})
>> 
>> This probably comes from AVR, but: no spaces before ":".
>> 
>> Bit surprised that we can get a CODE_LABEL rather than a LABEL_REF here.
>> Do you know if that triggers in practice, and if so where?
> Indeed, CODE_LABEL case is never reached. I'll leave gcc_unreachable here.
>
>> An IMO neater and slightly more robust way of writing the body is:
>>   poly_int64 offset:
>>   rtx base = strip_offset (op, &offset);
>>   switch (GET_CODE (base))
>>   
>>     {
>>     
>>     case LABEL_REF:
>>       ...as above...
>>     
>>     case SYMBOL_REF:
>>       ...as above...
>>     
>>     default:
>>       return false;
>>     
>>     }
>> 
>> with "plus" and "minus" not in the match_code list (since they should
>> always appear in consts if they really are text references).
>
> The "plus" and "minus" are needed when handling code labels as values. Take 
> for example the following construct:
>
>    int x = &&lab1 - &&lab0;
> lab1:
>   ...
> lab2:
>
>
> My TARGET_ASM_INTEGER callback uses the text_segment_operand predicate. In the 
> above case it is passed the following RTL expression:
> (minus:SI (label_ref/v:SI 20)
>     (label_ref/v:SI 27))
>
> I need to detect text labels so that I annotate them with %pmem:
>         .4byte	%pmem(.L4-(.L2))
> Instead of the incorrect:
>         .4byte   .L3-(.L2)

OK, thanks for the explanation.  I think the target-independent code should
really be passing (const (minus ...)) rather than a plain (minus ...) here,
but that's going to be difficult to change at this stage.  And like you say,
the split_offset suggestion wouldn't have handled this anyway.

So yeah, please keep what you have now.

>> > +/* Callback for walk_gimple_seq that checks TP tree for TI ABI
>> > compliance.  */ +static tree
>> > +check_op_callback (tree *tp, int *walk_subtrees, void *data)
>> > +{
>> > +  struct walk_stmt_info *wi = (struct walk_stmt_info *) data;
>> > +
>> > +  if (RECORD_OR_UNION_TYPE_P (*tp) || TREE_CODE (*tp) == ENUMERAL_TYPE)
>> > +    {
>> > +      /* Forward declarations have NULL tree type.  Skip them.  */
>> > +      if (TREE_TYPE (*tp) == NULL)
>> > +	return NULL;
>> > +    }
>> > +
>> > +  /* TODO - why C++ leaves INTEGER_TYPE forward declarations around?  */
>> > +  if (TREE_TYPE (*tp) == NULL)
>> > +    return NULL;
>> > +
>> > +  const tree type = TREE_TYPE (*tp);
>> > +
>> > +  /* Direct function calls are allowed, obviously.  */
>> > +  if (TREE_CODE (*tp) == ADDR_EXPR && TREE_CODE (type) == POINTER_TYPE)
>> > +    {
>> > +      const tree ptype = TREE_TYPE (type);
>> > +      if (TREE_CODE (ptype) == FUNCTION_TYPE)
>> > +	return NULL;
>> > +    }
>> 
>> This seems like a bit of a dangerous heuristic.  Couldn't it also cause
>> us to skip things like:
>> 
>>    (void *) func
>> 
>> ?  (OK, that's dubious C, but we do support it.)
> The cast yields a "data pointer", which is 32 bits for both types of ABI. 
> Hence it is safe to skip "(void *) func".
>
> The TI ABI's 16 bit function pointers become a problem when they change the 
> layout of structures and function argument registers.

OK.  The reason this stood out is that the code doesn't obviously match
the comment.  If the code is just trying to skip direct function calls,
I think the gcall sequence I mentioned would be more obvious, and would
match the existing comment.  If anything that takes the address of a
function is OK then it might be worth changing the comment to include that.

>> > +	    {
>> > +	      *total = 0;
>> > +	    }
>> > +	  else
>> > +	    {
>> > +	      /* SI move has the same cost as a QI move.  */
>> > +	      int factor = GET_MODE_SIZE (mode) / GET_MODE_SIZE (SImode);
>> > +	      if (factor == 0) 
>> > +		factor = 1;
>> > +	      *total = factor * COSTS_N_INSNS (1);
>> > +	    }
>> 
>> Could you explain this a bit more?  It looks like a pure QImode operation
>> gets a cost of 1 insn but an SImode operation zero-extended from QImode
>> gets a cost of 0.
> I have unintentionally bumped QI cost while trying to make zero-extends cheap. 
> I will fix by using the following simple logic:
>
>     case SET:
> 	  mode = GET_MODE (SET_DEST (x));
> 	  /* SI move has the same cost as a QI move.  Moves larger than
> 	     64 bits are costly.  */
> 	  int factor = CEIL (GET_MODE_SIZE (mode), GET_MODE_SIZE (SImode));
> 	  *total = factor * COSTS_N_INSNS (1);

Looks good.

>> > +/* Implement TARGET_PREFERRED_RELOAD_CLASS.  */
>> > +static reg_class_t
>> > +pru_preferred_reload_class (rtx x ATTRIBUTE_UNUSED, reg_class_t regclass)
>> > +{
>> > +  return regclass == NO_REGS ? GENERAL_REGS : regclass;
>> > +}
>> 
>> This looks odd: PREFERRED_RELOAD_CLASS should return a subset of the
>> original class rather than add new registers.  Can you remember why
>> it was needed?
> I'm not sure what target code is supposed to return when NO_REGS is passed 
> here. I saw how other ports handle NO_REGS (c-sky, m32c, nios2, rl78). So I 
> followed suite.
>
> Here is a backtrace of LRA when NO_REGS is passed:
> 0xf629ae pru_preferred_reload_class
>         ../../gcc/gcc/config/pru/pru.c:788
> 0xa3d6e8 process_alt_operands
>         ../../gcc/gcc/lra-constraints.c:2600
> 0xa3ef7d curr_insn_transform
>         ../../gcc/gcc/lra-constraints.c:3889
> 0xa4301e lra_constraints(bool)
>         ../../gcc/gcc/lra-constraints.c:4906
> 0xa2726c lra(_IO_FILE*)
>         ../../gcc/gcc/lra.c:2446
> 0x9c97a9 do_reload
>         ../../gcc/gcc/ira.c:5469
> 0x9c97a9 execute
>         ../../gcc/gcc/ira.c:5653

I think it should just pass NO_REGS through unmodified (which means
spill to memory).  In some ways it would be good if it didn't have to
handle this case, but again that's going to be work to fix.

The RA will pass ALL_REGS if it can handle any register, and wants to
know what the target would prefer.  But for any input, the hook needs
to stick within the registers it was given.

>> > +/* Helper function to get the starting storage HW register for an
>> > argument, +   or -1 if it must be passed on stack.  The cum_v state is
>> > not changed.  */ +static int
>> > +pru_function_arg_regi (cumulative_args_t cum_v,
>> > +		       machine_mode mode, const_tree type,
>> > +		       bool named)
>> > +{
>> > +  CUMULATIVE_ARGS *cum = get_cumulative_args (cum_v);
>> > +  size_t argsize = pru_function_arg_size (mode, type);
>> > +  size_t i, bi;
>> > +  int regi = -1;
>> > +
>> > +  if (!pru_arg_in_reg_bysize (argsize))
>> > +    return -1;
>> > +
>> > +  if (!named)
>> > +    return -1;
>> > +
>> > +  /* Find the first available slot that fits.  Yes, that's the PRU ABI. 
>> > */ +  for (i = 0; regi < 0 && i < ARRAY_SIZE (cum->regs_used); i++)
>> > +    {
>> > +      if (mode == BLKmode)
>> > +	{
>> > +	  /* Structs are passed, beginning at a full register.  */
>> > +	  if ((i % 4) != 0)
>> > +	    continue;
>> > +	}
>> 
>> The comment doesn't seem to match the code: a struct like:
>> 
>>   struct s { short x; };
>> 
>> will have HImode rather than BLKmode.
> I will update the comment to clarify it is only for large structs. Code should 
> be ok.
>
>> It's usually better to base ABI stuff on the type where possible,
>> since that corresponds directly to the source language, while modes
>> are more of an internal GCC thing.
> I will explore this option. But ABI is already defined in data sizes, which I 
> think neatly fit with GCC's modes.

Thanks.  And this certainly shouldn't hold up acceptance, it was just a
suggestion.

>> > +; LRA cannot cope with clobbered op2, hence the scratch register.
>> > +(define_insn "ashr<mode>3"
>> > +  [(set (match_operand:QISI 0 "register_operand"	    "=&r,r")
>> > +	  (ashiftrt:QISI
>> > +	    (match_operand:QISI 1 "register_operand"	    "0,r")
>> > +	    (match_operand:QISI 2 "reg_or_const_1_operand"  "r,P")))
>> > +   (clobber (match_scratch:QISI 3			    "=r,X"))]
>> > +  ""
>> > +  "@
>> > +   mov %3, %2\;ASHRLP%=:\;qbeq ASHREND%=, %3, 0\; sub %3, %3, 1\;
>> > lsr\\t%0, %0, 1\; qbbc ASHRLP%=, %0, (%S0 * 8) - 2\; set %0, %0, (%S0 *
>> > 8) - 1\; jmp ASHRLP%=\;ASHREND%=: +   lsr\\t%0, %1, 1\;qbbc LSIGN%=, %0,
>> > (%S0 * 8) - 2\;set %0, %0, (%S0 * 8) - 1\;LSIGN%=:" +  [(set_attr "type"
>> > "complex,alu")
>> > +   (set_attr "length" "28,4")])
>> 
>> What do you mean by LRA not coping?  What did you try originally and
>> what went wrong?
> Better assembly could be generated if the shift count register was used for a 
> loop counter. Pseudo code:
>
>    while (op2--)
>      op0 >>= 1;
>  
> I originally tried to clobber operand 2:
>  (define_insn "ashr<mode>3"
>    [(set (match_operand:QISI 0 "register_operand" "=&r,r")
>          (ashiftrt:QISI
>            (match_operand:QISI 1 "register_operand" "0,r")
>            (match_operand:QISI 2 "reg_or_const_1_operand" "+r,P"))
>           )]
>
> But with the above pattern operand 2 was not clobbered. Its value was deemed 
> untouched (i.e. live across the pattern). So I came up with  clobbering a 
> separate register to fix this, at the expense of slightly bigger generated 
> code.

Ah, yeah, "+" has to be on an output operand (the destination of a set
or clobber).

In this kind of situation you might be better off expanding the loop
as individual RTL instructions, which is what several targets do for
things like memmove.  But if you'd prefer to keep a single pattern,
you could use:

  [(set (match_operand:QISI 0 "register_operand" "=r")
	(ashiftrt:QISI
	  (match_operand:QISI 2 "register_operand" "0")
	  (match_operand:QISI 3 "register_operand" "1")))
   (set (match_operand:QISI 1 "register_operand" "=r")
	(const_int -1))]

(assuming the optimised loop does leave operand 2 containing all-1s).

The shift by 1 would then need to be a separate pattern, with the
define_expand choosing between them.

>> > +(define_insn "<code>di3"
>> > +  [(set (match_operand:DI 0 "register_operand"		"=&r,&r")
>> > +	  (LOGICAL_BITOP:DI
>> > +	    (match_operand:DI 1 "register_operand"	"%r,r")
>> > +	    (match_operand:DI 2 "reg_or_ubyte_operand"	"r,I")))]
>> > +  ""
>> > +  "@
>> > +   <logical_bitop_asm>\\t%F0, %F1, %F2\;<logical_bitop_asm>\\t%N0, %N1,
>> > %N2 +   <logical_bitop_asm>\\t%F0, %F1, %2\;<logical_bitop_asm>\\t%N0,
>> > %N1, 0" +  [(set_attr "type" "alu,alu")
>> > +   (set_attr "length" "8,8")])
>> > +
>> > +
>> > +(define_insn "one_cmpldi2"
>> > +  [(set (match_operand:DI 0 "register_operand"		"=r")
>> > +	(not:DI (match_operand:DI 1 "register_operand"	"r")))]
>> > +  ""
>> > +{
>> > +  /* careful with overlapping source and destination regs.  */
>> > +  gcc_assert (GP_REG_P (REGNO (operands[0])));
>> > +  gcc_assert (GP_REG_P (REGNO (operands[1])));
>> > +  if (REGNO (operands[0]) == (REGNO (operands[1]) + 4))
>> > +    return "not\\t%N0, %N1\;not\\t%F0, %F1";
>> > +  else
>> > +    return "not\\t%F0, %F1\;not\\t%N0, %N1";
>> > +}
>> > +  [(set_attr "type" "alu")
>> > +   (set_attr "length" "8")])
>> 
>> Do you see any code improvements from defining these patterns?
>> These days I'd expect you'd get better code by letting the
>> target-independent code split them up into SImode operations.
> Yes, these patterns improve code size and speed.
>
> Looking at expand_binop(), DI logical ops are split into WORD mode, which in 
> PRU's peculiar case is QI. So without those patterns, a 64 bit IOR generates 8 
> QI instructions:
>
>         or      r0.b0, r14.b0, r16.b0
>         or      r0.b1, r14.b1, r16.b1
>         or      r0.b2, r14.b2, r16.b2
>         or      r0.b3, r14.b3, r16.b3
>         or      r1.b0, r15.b0, r17.b0
>         or      r1.b1, r15.b1, r17.b1
>         or      r1.b2, r15.b2, r17.b2
>         or      r1.b3, r15.b3, r17.b3
>
> Whereas with patterns defined, it is only 2 SImode instructions:
>         or      r0, r14, r16
>         or      r1, r15, r17

Ah, yeah, I'd forgotten about words being QImode.  Ideally the doubleword
handling would be extended to any mode that is twice the width of a "cheap"
mode, but again that'd be a fair amount of work.

So yeah, please keep this as-is.

>> > +;; Multiply instruction.  Idea for fixing registers comes from the AVR
>> > backend. +
>> > +(define_expand "mulsi3"
>> > +  [(set (match_operand:SI 0 "register_operand" "")
>> > +	(mult:SI (match_operand:SI 1 "register_operand" "")
>> > +		 (match_operand:SI 2 "register_operand" "")))]
>> > +  ""
>> > +{
>> > +  emit_insn (gen_mulsi3_fixinp (operands[0], operands[1], operands[2]));
>> > +  DONE;
>> > +})
>> > +
>> > +
>> > +(define_expand "mulsi3_fixinp"
>> > +  [(set (reg:SI 112) (match_operand:SI 1 "register_operand" ""))
>> > +   (set (reg:SI 116) (match_operand:SI 2 "register_operand" ""))
>> > +   (set (reg:SI 104) (mult:SI (reg:SI 112) (reg:SI 116)))
>> > +   (set (match_operand:SI 0 "register_operand" "") (reg:SI 104))]
>> > +  ""
>> > +{
>> > +})
>> 
>> This seems slightly dangerous since there's nothing to guarantee that
>> those registers aren't already live at the point of expansion.
>> 
>> The more usual way (and IMO correct way) would be for:
>> > +(define_insn "*mulsi3_prumac"
>> > +  [(set (reg:SI 104) (mult:SI (reg:SI 112) (reg:SI 116)))]
>> > +  ""
>> > +  "nop\;xin\\t0, r26, 4"
>> > +  [(set_attr "type" "alu")
>> > +   (set_attr "length" "8")])
>> 
>> to have register classes and constraints for these three registers,
>> like e.g. the x86 port does for %cx etc.
>> 
>> It would be good if you could try that.  On the other hand, if AVR
>> already does this and it worked in practice then I guess it's not
>> something that should hold up the port.
> This suggestion worked. It was a matter of correct predicates.
>
> Here is what I will put in the next patch version:
>
> (define_predicate "pru_muldst_operand"
>   (and (match_code "reg")
>        (ior (match_test "REGNO_REG_CLASS (REGNO (op)) == MULDST_REGS")
> 	    (match_test "REGNO (op) >= FIRST_PSEUDO_REGISTER"))))
> ...
> (define_register_constraint "Rmd0" "MULDST_REGS"
>   "@internal
>   The multiply destination register.")
> ...
>
> (define_insn "mulsi3"
>   [(set (match_operand:SI 0 "pru_muldst_operand"	   "=Rmd0")
> 	(mult:SI (match_operand:SI 1 "pru_mulsrc0_operand" "Rms0")
> 		 (match_operand:SI 2 "pru_mulsrc1_operand" "Rms1")))]

Looks good.  Thanks for doing this.

You could also add a "%" constraint to operand 1 to allow the input
registers to be swapped, just on the off chance that that's useful.
But I suppose there's a risk that providing that alternative will
confuse the RA cost model a bit.  Just a suggestion.

>> > @@ -23444,6 +23449,56 @@ the offset with a symbol reference to a canary in
>> > the TLS block.> 
>> >  @end table
>> > 
>> > +@node PRU Options
>> > +@subsection PRU Options
>> > +@cindex PRU Options
>> > +
>> > +These command-line options are defined for PRU target:
>> > +
>> > +@table @gcctabopt
>> > +@item -minrt
>> > +@opindex minrt
>> > +Enable the use of a minimum runtime environment---no static
>> > +initializers or constructors.  Results in significant code size
>> > +reduction of final ELF binaries.
>> 
>> Up to you, but this read to me as "-m"+"inrt".  If this option is already
>> in use then obviously keep it, but if it's new then it might be worth
>> calling it -mminrt instead, for consistency with -mmcu.
> I assumed this is a standard naming since MSP430 already uses it. I've seen 
> the option being utilized by pru-gcc users, so let's keep it that way.

OK.

>> Despite the long reply (sorry), this looks in really good shape to me FWIW.
>> Thanks for the submission.
> I take this sentence as an indication that the PRU port is welcome for 
> inclusion into GCC. I'll work on fixing the comments and will send a new patch 
> version.

Well, process-wise, it's up to the steering committee to decide whether
to accept the port.  But one of the main requirements is of course
reviewing the patches.

Which of the other patches in the series still need review?  I think Jeff
approved some of them earlier.

Thanks,
Richard