[PATCH 2/2]AArch64: Add better costing for vector constants and operations

Richard Sandiford richard.sandiford@arm.com
Tue Aug 31 18:37:48 GMT 2021


Tamar Christina <Tamar.Christina@arm.com> writes:
>> -----Original Message-----
>> From: Richard Sandiford <richard.sandiford@arm.com>
>> Sent: Tuesday, August 31, 2021 5:07 PM
>> To: Tamar Christina <Tamar.Christina@arm.com>
>> Cc: gcc-patches@gcc.gnu.org; nd <nd@arm.com>; Richard Earnshaw
>> <Richard.Earnshaw@arm.com>; Marcus Shawcroft
>> <Marcus.Shawcroft@arm.com>; Kyrylo Tkachov <Kyrylo.Tkachov@arm.com>
>> Subject: Re: [PATCH 2/2]AArch64: Add better costing for vector constants
>> and operations
>> 
>> Tamar Christina <Tamar.Christina@arm.com> writes:
>> >> -----Original Message-----
>> >> From: Richard Sandiford <richard.sandiford@arm.com>
>> >> Sent: Tuesday, August 31, 2021 4:14 PM
>> >> To: Tamar Christina <Tamar.Christina@arm.com>
>> >> Cc: gcc-patches@gcc.gnu.org; nd <nd@arm.com>; Richard Earnshaw
>> >> <Richard.Earnshaw@arm.com>; Marcus Shawcroft
>> >> <Marcus.Shawcroft@arm.com>; Kyrylo Tkachov
>> <Kyrylo.Tkachov@arm.com>
>> >> Subject: Re: [PATCH 2/2]AArch64: Add better costing for vector
>> >> constants and operations
>> >>
>> >> Tamar Christina <tamar.christina@arm.com> writes:
>> >> > @@ -13936,8 +13937,65 @@ cost_plus:
>> >> >  			     mode, MULT, 1, speed);
>> >> >            return true;
>> >> >          }
>> >> > +	break;
>> >> > +    case PARALLEL:
>> >> > +      /* Fall through */
>> >>
>> >> Which code paths lead to getting a PARALLEL here?
>> >
>> > Hi,
>> >
>> > Thanks for the review!
>> >
>> > I added it for completeness because CSE treats a parallel and
>> > CONST_VECTOR as equivalent when they each entry in the parallel defines
>> a constant.
>> 
>> Could you test whether it ever triggers in practice though?
>> The code would be much simpler without it.
>
> Will check 😊
>
>> 
>> >> > +    case CONST_VECTOR:
>> >> > +	{
>> >> > +	  rtx gen_insn = aarch64_simd_make_constant (x, true);
>> >> > +	  /* Not a valid const vector.  */
>> >> > +	  if (!gen_insn)
>> >> > +	    break;
>> >> >
>> >> > -      /* Fall through.  */
>> >> > +	  switch (GET_CODE (gen_insn))
>> >> > +	  {
>> >> > +	  case CONST_VECTOR:
>> >> > +	    /* Load using MOVI/MVNI.  */
>> >> > +	    if (aarch64_simd_valid_immediate (x, NULL))
>> >> > +	      *cost += extra_cost->vect.movi;
>> >> > +	    else /* Load using constant pool.  */
>> >> > +	      *cost += extra_cost->ldst.load;
>> >> > +	    break;
>> >> > +	  /* Load using a DUP.  */
>> >> > +	  case VEC_DUPLICATE:
>> >> > +	    *cost += extra_cost->vect.dup;
>> >> > +	    break;
>> >>
>> >> Does this trigger in practice?  The new check==true path (rightly)
>> >> stops the duplicated element from being forced into a register, but
>> >> then I would have
>> >> expected:
>> >>
>> >> rtx
>> >> gen_vec_duplicate (machine_mode mode, rtx x) {
>> >>   if (valid_for_const_vector_p (mode, x))
>> >>     return gen_const_vec_duplicate (mode, x);
>> >>   return gen_rtx_VEC_DUPLICATE (mode, x); }
>> >>
>> >> to generate the original CONST_VECTOR again.
>> >
>> > Yes, but CSE is trying to see whether using a DUP is cheaper than another
>> instruction.
>> > Normal code won't hit this but CSE is just costing all the different
>> > ways one can semantically construct a vector, which RTL actually comes out
>> of it depends on how it's folded as you say.
>> 
>> But what I mean is, you call:
>> 
>> 	  rtx gen_insn = aarch64_simd_make_constant (x, true);
>> 	  /* Not a valid const vector.  */
>> 	  if (!gen_insn)
>> 	    break;
>> 
>> where aarch64_simd_make_constant does:
>> 
>>   if (CONST_VECTOR_P (vals))
>>     const_vec = vals;
>>   else if (GET_CODE (vals) == PARALLEL)
>>     {
>>       /* A CONST_VECTOR must contain only CONST_INTs and
>> 	 CONST_DOUBLEs, but CONSTANT_P allows more (e.g. SYMBOL_REF).
>> 	 Only store valid constants in a CONST_VECTOR.  */
>>       int n_elts = XVECLEN (vals, 0);
>>       for (i = 0; i < n_elts; ++i)
>> 	{
>> 	  rtx x = XVECEXP (vals, 0, i);
>> 	  if (CONST_INT_P (x) || CONST_DOUBLE_P (x))
>> 	    n_const++;
>> 	}
>>       if (n_const == n_elts)
>> 	const_vec = gen_rtx_CONST_VECTOR (mode, XVEC (vals, 0));
>>     }
>>   else
>>     gcc_unreachable ();
>> 
>>   if (const_vec != NULL_RTX
>>       && aarch64_simd_valid_immediate (const_vec, NULL))
>>     /* Load using MOVI/MVNI.  */
>>     return const_vec;
>>   else if ((const_dup = aarch64_simd_dup_constant (vals, check)) !=
>> NULL_RTX)
>>     /* Loaded using DUP.  */
>>     return const_dup;
>> 
>> and aarch64_simd_dup_constant does:
>> 
>>   machine_mode mode = GET_MODE (vals);
>>   machine_mode inner_mode = GET_MODE_INNER (mode);
>>   rtx x;
>> 
>>   if (!const_vec_duplicate_p (vals, &x))
>>     return NULL_RTX;
>> 
>>   /* We can load this constant by using DUP and a constant in a
>>      single ARM register.  This will be cheaper than a vector
>>      load.  */
>>   if (!check)
>>     x = copy_to_mode_reg (inner_mode, x);
>>   return gen_vec_duplicate (mode, x);
>> 
>> For the “check” case, “x” will be a constant, and so gen_vec_duplicate will call
>> gen_const_vec_duplicate, which will return a CONST_VECTOR.
>> It didn't seem to be possible for gen_insn to be a VEC_DUPLICATE.
>>
>
> Yes, but CSE can ask the cost of a VEC_DUPLICATE directly on a register without going through gen_const_vec_duplicate
> which is intended as the gen_ functions can have side effects (e.g. creating new psuedos etc)
>
> If say it sees a constant x and a vector [x x x x] it wants to know what the cost keeping
> x and materializing [x x x x] vs doing a duplicate of x into [x x x x] is.
>
> In this case since both the constant and the vectors are needed you won't get a constant there but a register so you'll actually see a
> vec_dup. If CSE pushes in the constant that would defeat the point 😊. Right now it's CSE that's pushing constants of vec_dup into vec_constants.
>
> My change is making it explicitly ask for the cost of doing this instead of assuming it always cheaper because for a large majority of
> cases it's not actually cheaper and is highly dependent on the targets ability to create said constant.
>
> So this hook will see both versions, the dup of the register and the vec_constant while CSE is trying to decide which one to keep.

But the code I quoted above is from:

+	break;
+    case PARALLEL:
+      /* Fall through */
+    case CONST_VECTOR:
+	{
+	  rtx gen_insn = aarch64_simd_make_constant (x, true);
+	  /* Not a valid const vector.  */
+	  if (!gen_insn)
+	    break;
 
-      /* Fall through.  */
+	  switch (GET_CODE (gen_insn))
+	  {
+	  case CONST_VECTOR:
+	    /* Load using MOVI/MVNI.  */
+	    if (aarch64_simd_valid_immediate (x, NULL))
+	      *cost += extra_cost->vect.movi;
+	    else /* Load using constant pool.  */
+	      *cost += extra_cost->ldst.load;
+	    break;
+	  /* Load using a DUP.  */
+	  case VEC_DUPLICATE:
+	    *cost += extra_cost->vect.dup;
+	    break;
+	  default:
+	    *cost += extra_cost->ldst.load;
+	    break;
+	  }
+	  return true;
+	}

Here, CSE is passing in a PARALLEL or a CONST_VECTOR.  That rtx
then gets passed to aarch64_simd_make_constant.  We then switch
based on the result of aarch64_simd_make_constant, with a case
statement for VEC_DUPLICATE.  So the code is handling a case in
which aarch64_simd_make_constant converts a PARALLEL or a
CONST_VECTOR (passed by CSE) into a VEC_DUPLICATE.  For the reasons
above, that doesn't seem to be possible.  aarch64_simd_make_constant
would return duplicated constants as a CONST_VECTOR rather than
a VEC_DUPLICATE.

It sounds like you're talking about the separate top-level
VEC_DUPLICATE case, which is obviously OK/needed.

Maybe it would be better to turn it around and say: do you have
a case in which the nested VEC_DUPLICATE case above is reached?

>> This would be much simpler if we could call aarch64_simd_valid_immediate
>> and aarch64_simd_dup_constant directly from the rtx cost code,

BTW, I meant const_vec_duplicate_p here. sorry.

> Agreed... I tried to separate them before, but the logic was annoying to split and I thought not worth the effort, so instead I just
> changed it to have a checking only mode.
>
>> hence the
>> question about whether the PARALLEL stuff was really needed in practice.
>> 
>> >> > +	  default:
>> >> > +	    *cost += extra_cost->ldst.load;
>> >> > +	    break;
>> >> > +	  }
>> >> > +	  return true;
>> >> > +	}
>> >> > +    case VEC_CONCAT:
>> >> > +	/* depending on the operation, either DUP or INS.
>> >> > +	   For now, keep default costing.  */
>> >> > +	break;
>> >> > +    case VEC_DUPLICATE:
>> >> > +	*cost += extra_cost->vect.dup;
>> >> > +	return true;
>> >> > +    case VEC_SELECT:
>> >> > +	{
>> >> > +	  /* cost subreg of 0 as free, otherwise as DUP */
>> >> > +	  rtx op1 = XEXP (x, 1);
>> >> > +	  int nelts;
>> >> > +	  if ((op1 == const0_rtx && !BYTES_BIG_ENDIAN)
>> >> > +	      || (BYTES_BIG_ENDIAN
>> >> > +		  && GET_MODE_NUNITS (mode).is_constant(&nelts)
>> >> > +		  && INTVAL (op1) == nelts - 1))
>> >> > +	    ;
>> >> > +	  else if (vec_series_lowpart_p (mode, GET_MODE (op1), op1))
>> >> > +	    ;
>> >> > +	  else if (vec_series_highpart_p (mode, GET_MODE (op1), op1))
>> >> > +	  /* Selecting the high part is not technically free, but we lack
>> >> > +	     enough information to decide that here.  For instance selecting
>> >> > +	     the high-part of a vec_dup *is* free or to feed into any _high
>> >> > +	     instruction.   Both of which we can't really tell.  That said
>> >> > +	     have a better chance to optimize an dup vs multiple constants.  */
>> >> > +	    ;
>> >>
>> >> Not sure about this.  We already try to detect the latter case (_high
>> >> instructions) via aarch64_strip_extend_vec_half.  We might be missing
>> >> some cases, but that still feels like the right way to go IMO.
>> >
>> > That's a different problem from what I understand.  What this is
>> > trying to say is that If you have a vector [x y a b] and you need
>> > vector [x y] that you can use the top part of the original vector for this.
>> >
>> > This is an approximation, because something that can be created with a
>> > movi is probably Cheaper to keep distinct if it's not going to be paired with a
>> _high operation (since you will have a dup then).
>> >
>> > The problem is that the front end has already spit the two Vectors into [x y
>> a b] and [x y].
>> > There's nothing else that tries to consolidate them back up if both survive.
>> >
>> > As a consequence of this, the testcase test0 is not handled optimally.
>> > It would instead create
>> > 2 vectors, both of movi 0x3, just one being 64-bits and one being 128-bits.
>> >
>> > So if the cost of selecting it is cheaper than the movi, cse will not
>> > consolidate the vectors, and because movi's are so cheap, the only
>> > cost that worked was 0.  But increasing the costs of movi's requires the
>> costs of everything to be increased (including loads).
>> >
>> > I preferred to 0 out the cost, because the worst that can happen is an
>> > dup instead of a movi, And at best a dup instead of a load from a pool (if
>> the constant is complicated).
>> 
>> Hmm, will need to look at this more tomorrow.
>> 
>> >> Selecting the high part of a vec_dup should get folded into another
>> vec_dup.
>> >>
>> >> The lowpart bits look OK, but which paths call this function without
>> >> first simplifying the select to a subreg?  The subreg is now the
>> >> canonical form (thanks to r12-2288).
>> >
>> > The simplification will happen during folding in cse or in combine.
>> > This costing happens before the folding, When CSE is trying to decide
>> whether to undo the front end's lowering of constants.
>> >
>> > To do so it models the constants and the semantic operation required
>> > to extract them. E.g. to get
>> > 2 out of [0 2 4 5] it would need a VEC_SELECT of 1. And I don't treat
>> > the first element/bottom part special Here.  Costing wise they would be
>> the same.
>> 
>> But which code path creates the VEC_SELECT?  We don't need any context to
>> know that the VEC_SELECT is non-canonical.  It's obvious from the operands
>> of the VEC_SELECT in isolation.
>
> The non-cannonical RTL is never generated. I assume we're talking about the 0 case here
> Since subregs can't select arbitrary elements (as I asked before).
>
> For the 0 case it's only temporarily modelled as such as such to keep the CSE alternative costing simple.
> Currently it's just a for loop for I = 0 to vec_elems.

Ah, sorry, I see now that you're talking about the 1/2 patch.
I looked at this one first :-)

> When it comes time to generate the actual insn fold_rtx is called which will fold the VEC_SELECT
> Into a subreg.
>
> So it's never emitted into the instruction stream in its non canonical form.
>
>> 
>> I'd just rather tackle this at source than try to get the cost code to handle
>> non-canonical rtl.
>
> If that's what is preferred I can change the CSE patch to generate a subreg for the 0 case, I'm not sure I agree with it
> as CSE is just trying to ask "what Is the cost of selecting the element 0 in this vector".  And as I mentioned before
> it never emits the instruction unfolded.  This representation seems to a more logical representation for costing to me.

I think it's better to cost what we intend to generate.  Otherwise each
target needs to handle both forms: “CSE asks about this, but actually
intends to generate that instead”.

> It's however unfortunate that there's only one costing callback, as far as CSE is concerned the representation/form
> doesn't matter, it's just looking at the high level operation.
>
> Or is the concern here that most targets will have costing for subreg 0 but not VEC_SELECT? In which case without
> Actually handling the costs of the other operations the CSE changes won't do anything for targets anyway.  And it would
> be odd for a target to cost VEC_SELECT from 1 to <N> instead of just costing 0 too.

Well, even for the motivating target (aarch64), we had to make changes
to treat index 0 as especially cheap.  That's likely to be necessary on
other targets too, if they want to take advantage of this.  The for
loop exists because the index matters.

I'm still a bit sceptical about treating the high-part cost as lower.
ISTM that the subreg cases are the ones that are truly “free” and any
others should have a normal cost.  So if CSE handled the subreg case
itself (to model how the rtx would actually be generated) then aarch64
code would have to do less work.  I imagine that will be true for other
targets as well.

Thanks,
Richard


More information about the Gcc-patches mailing list