[PATCH 5/7 v6] vect: Support vector load/store with length in vectorizer

Wed Jul 1 13:23:39 GMT 2020

Hi Richard,

Many thanks for your great review comments!

on 2020/7/1 上午3:53, Richard Sandiford wrote:
> "Kewen.Lin" <linkw@linux.ibm.com> writes:
>> diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
>> index 06a04e3d7dd..284c15705ea 100644
>> --- a/gcc/doc/invoke.texi
>> +++ b/gcc/doc/invoke.texi
>> @@ -13389,6 +13389,13 @@ by the copy loop headers pass.
>>  @item vect-epilogues-nomask
>>  Enable loop epilogue vectorization using smaller vector size.
>>  
>> +@item vect-with-length-scope
> 
> In principle there's nothing length-specific about this option.
> We could do the same for masks or for any future loop control
> mechanism.  So how about vect-partial-vector-usage instead?
> 

Sounds good, will update it. 

[snip] 

I will also update as the comments in snipped parts (if they have)
snip some of them to have good readablity.

>> +      machine_mode vmode;
>> +      /* Check whether the related VnQI vector mode exists, as well as
>> +	 optab supported.  */
>> +      if (related_vector_mode (mode, emode, nunits).exists (&vmode)
>> +	  && direct_optab_handler (op, vmode))
>> +	{
>> +	  unsigned int mul;
>> +	  scalar_mode orig_emode = GET_MODE_INNER (mode);
>> +	  poly_uint64 orig_esize = GET_MODE_SIZE (orig_emode);
>> +
>> +	  if (constant_multiple_p (orig_esize, esize, &mul))
>> +	    *factor = mul;
>> +	  else
>> +	    gcc_unreachable ();
> 
> This is just:
> 
> 	  *factor = GET_MODE_UNIT_SIZE (mode);
> 
> However, I think it would be better to return the vector mode that the
> load or store should use, instead of this factor.  That way we can reuse
> it when generating the load and store statements.
> 
> So maybe call the function get_len_load_store_mode and return an
> opt_machine_mode.
> 

Will improve it.

>> diff --git a/gcc/params.opt b/gcc/params.opt
>> index 9b564bb046c..daa6e8a2beb 100644
>> --- a/gcc/params.opt
>> +++ b/gcc/params.opt
>> @@ -968,4 +968,8 @@ Bound on number of runtime checks inserted by the vectorizer's loop versioning f
>>  Common Joined UInteger Var(param_vect_max_version_for_alignment_checks) Init(6) Param Optimization
>>  Bound on number of runtime checks inserted by the vectorizer's loop versioning for alignment check.
>>  
>> +-param=vect-with-length-scope=
>> +Common Joined UInteger Var(param_vect_with_length_scope) Init(0) IntegerRange(0, 2) Param Optimization
>> +Control the vector with length exploitation scope.
> 
> Think this should be a bit more descriptive, at least saying what the
> three values are (but in a more abbreviated form than the .texi above).
> 
> I think the default should be 2, with targets actively turning it down
> where necessary.  That way, the decision to turn it down is more likely
> to have a comment explaining why.
> 

Will update both.

>> +
>>    tree ctrl_type = rgc->type;
>> -  unsigned int nscalars_per_iter = rgc->max_nscalars_per_iter;
>> +  /* Scale up nscalars per iteration with factor.  */
>> +  unsigned int nscalars_per_iter_ft = rgc->max_nscalars_per_iter * rgc->factor;
> 
> Maybe “scaled_nscalars_per_iter”?  Not sure the comment really adds
> anything here.
> 
> Or maybe “nitems_per_iter”, to keep the names shorter?
> 

Will use the short one.

>>    poly_uint64 vf = LOOP_VINFO_VECT_FACTOR (loop_vinfo);
>> +  tree length_limit = NULL_TREE;
>> +  /* For length, we need length_limit to check length in range.  */
>> +  if (!vect_for_masking)
>> +    {
>> +      poly_uint64 len_limit = nscalars_per_ctrl * rgc->factor;
>> +      length_limit = build_int_cst (compare_type, len_limit);
>> +    }
>>  
>>    /* Calculate the maximum number of scalar values that the rgroup
>>       handles in total, the number that it handles for each iteration
>> @@ -434,12 +445,12 @@ vect_set_loop_controls_directly (class loop *loop, loop_vec_info loop_vinfo,
>>    tree nscalars_total = niters;
>>    tree nscalars_step = build_int_cst (iv_type, vf);
>>    tree nscalars_skip = niters_skip;
>> -  if (nscalars_per_iter != 1)
>> +  if (nscalars_per_iter_ft != 1)
>>      {
>>        /* We checked before setting LOOP_VINFO_USING_PARTIAL_VECTORS_P that
>>  	 these multiplications don't overflow.  */
>> -      tree compare_factor = build_int_cst (compare_type, nscalars_per_iter);
>> -      tree iv_factor = build_int_cst (iv_type, nscalars_per_iter);
>> +      tree compare_factor = build_int_cst (compare_type, nscalars_per_iter_ft);
>> +      tree iv_factor = build_int_cst (iv_type, nscalars_per_iter_ft);
>>        nscalars_total = gimple_build (preheader_seq, MULT_EXPR, compare_type,
>>  				     nscalars_total, compare_factor);
>>        nscalars_step = gimple_build (preheader_seq, MULT_EXPR, iv_type,
>> @@ -509,7 +520,7 @@ vect_set_loop_controls_directly (class loop *loop, loop_vec_info loop_vinfo,
>>  	     NSCALARS_SKIP to that cannot overflow.  */
>>  	  tree const_limit = build_int_cst (compare_type,
>>  					    LOOP_VINFO_VECT_FACTOR (loop_vinfo)
>> -					    * nscalars_per_iter);
>> +					    * nscalars_per_iter_ft);
>>  	  first_limit = gimple_build (preheader_seq, MIN_EXPR, compare_type,
>>  				      nscalars_total, const_limit);
>>  	  first_limit = gimple_build (preheader_seq, PLUS_EXPR, compare_type,
> 
> It looks odd that we don't need to adjust the other nscalars_* values too.
> E.g. the above seems to be comparing an unscaled nscalars_total with
> a scaled nscalars_per_iter.  I think the units ought to “agree”,
> both here and in the rest of the function.
> 

Sorry, I didn't quite follow this comment.  Both nscalars_totoal and
nscalars_step are scaled here.  The remaining related nscalars_*
seems only nscalars_skip, but length can't support skip.

>>  	}
>>  
>> +      /* First iteration is full.  */
> 
> This comment belongs inside the “if”.
> 

Sorry, I might miss something, but isn't this applied for both?

>>        if (!init_ctrl)
>> -	/* First iteration is full.  */
>> -	init_ctrl = build_minus_one_cst (ctrl_type);
>> +	{
>> +	  if (vect_for_masking)
>> +	    init_ctrl = build_minus_one_cst (ctrl_type);
>> +	  else
>> +	    init_ctrl = length_limit;
>> +	}
>>  
>> […]
>> @@ -2568,7 +2608,8 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree niters, tree nitersm1,
>>    if (vect_epilogues
>>        && LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
>>        && prolog_peeling >= 0
>> -      && known_eq (vf, lowest_vf))
>> +      && known_eq (vf, lowest_vf)
>> +      && !LOOP_VINFO_USING_PARTIAL_VECTORS_P (epilogue_vinfo))
>>      {
>>        unsigned HOST_WIDE_INT eiters
>>  	= (LOOP_VINFO_INT_NITERS (loop_vinfo)
> 
> I'm still not really convinced that this check is right.  It feels
> like it's hiding a problem elsewhere.
> 

The comments above this hunk is that:

  /* If we know the number of scalar iterations for the main loop we should
     check whether after the main loop there are enough iterations left over
     for the epilogue.  */

So it's to check the ones in loop_vinfo->epilogue_vinfos whether can be removed.
And the main work in the loop is to remove epil_info from epilogue_vinfos.

To make it simply, let's assume prolog_peeling and LOOP_VINFO_PEELING_FOR_GAPS
are zero, vf == lowest_vf.  

   eiters = eiters % lowest_vf + LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo)

eiters is the remaining iterations which can't be handled in main loop with 
full (width/lanes) vectors.

For partial vectors epilogue handlings, loop_vinfo->vector_mode and 
epilogue_vinfo->vector_mode is the same (specially).

      while (!(constant_multiple_p
	       (GET_MODE_SIZE (loop_vinfo->vector_mode),
		GET_MODE_SIZE (epilogue_vinfo->vector_mode), &ratio)
	       && eiters >= lowest_vf / ratio + epilogue_gaps))

It means that the ratio is 1 (specially), the lowest_vf/ratio is still vf, 
the remaining eiters is definitely less than vf, then the loop_vinfo->epilogue_vinfos[0]
gets removed.

I think the reason why partial vectors epilogue is special here is the VF of main loop 
is the same as the VF of epilogue loop.  Normally VF of epilogue loop should be
less than VF of main loop (here it seems assuming it's multiple relationship).

>> […]
>> @@ -1072,6 +1074,88 @@ vect_verify_full_masking (loop_vec_info loop_vinfo)
>>    return true;
>>  }
>>  
>> +/* Check whether we can use vector access with length based on precison
>> +   comparison.  So far, to keep it simple, we only allow the case that the
>> +   precision of the target supported length is larger than the precision
>> +   required by loop niters.  */
>> +
>> +static bool
>> +vect_verify_loop_lens (loop_vec_info loop_vinfo)
>> +{
>> +  vec_loop_lens *lens = &LOOP_VINFO_LENS (loop_vinfo);
>> +
>> +  if (LOOP_VINFO_LENS (loop_vinfo).is_empty ())
>> +    return false;
>> +
>> +  /* The one which has the largest NV should have max bytes per iter.  */
>> +  rgroup_controls *rgl = &(*lens)[lens->length () - 1];
> 
> “lens->last ()”.  Using a reference feels more natural here.
> 

Will fix it.

>> +
>> +  /* Work out how many bits we need to represent the length limit.  */
>> +  unsigned int nscalars_per_iter_ft = rgl->max_nscalars_per_iter * rgl->factor;
> 
> I think this breaks the abstraction.  There's no guarantee that the
> factor is the same for each rgroup_control, so there's no guarantee
> that the maximum bytes per iter comes the last entry.  (Also, it'd
> be better to avoid talking about bytes if we're trying to be general.)
> I think we should take the maximum of each entry instead.
> 

Agree!  I guess the above "maximum bytes per iter" is a typo? and you meant
"maximum elements per iter"?  Yes, the code is for length in bytes, checking
the last entry is only reasonable for it.  Will update it to check all entries
instead.

>> +     we perfer to still use the niters type.  */
>> +  unsigned int ni_prec
>> +    = TYPE_PRECISION (TREE_TYPE (LOOP_VINFO_NITERS (loop_vinfo)));
>> +  /* Prefer to use Pmode and wider IV to avoid narrow conversions.  */
>> +  unsigned int pmode_prec = GET_MODE_BITSIZE (Pmode);
>> +
>> +  unsigned int required_prec = ni_prec;
>> +  if (required_prec < pmode_prec)
>> +    required_prec = pmode_prec;
>> +
>> +  tree iv_type = NULL_TREE;
>> +  if (min_ni_prec > required_prec)
>> +    {
> 
> Do we need this condition?  Looks like we could just do:
> 
>   min_ni_prec = MAX (min_ni_prec, GET_MODE_BITSIZE (Pmode));
>   min_ni_prec = MAX (min_ni_prec, ni_prec);
> 
> and then run the loop below.
> 

I think the assumption holds that Pmode and niters type are standard integral
type?  If so, both of them don't need the below loop to build the integer
type, but min_ni_prec needs.  Does it make sense to differentiate them?

>> +      /* Decide whether to use fully-masked approach.  */
>> +      if (vect_verify_full_masking (loop_vinfo))
>> +	LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo) = true;
>> +      /* Decide whether to use length-based approach.  */
>> +      else if (vect_verify_loop_lens (loop_vinfo))
>> +	{
>> +	  if (LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo)
>> +	      || LOOP_VINFO_PEELING_FOR_ALIGNMENT (loop_vinfo))
>> +	    {
>> +	      if (dump_enabled_p ())
>> +		dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
>> +				 "can't vectorize this loop with length-based"
>> +				 " partial vectors approach becuase peeling"
>> +				 " for alignment or gaps is required.\n");
>> +	      LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo) = false;
>> +	    }
> 
> Why are these peeling cases necessary?  Peeling for gaps should
> just mean subtracting one scalar iteration from the iteration count
> and shouldn't otherwise affect the main loop.  Similarly, peeling for
> alignment can be handled in the normal way, with a scalar prologue loop.
> 

I was thinking to relax this later and to avoid to handle too many cases
in the first enablement patch.  Since Power hw whose level is able to support
vector with length, it supports unaligned load/store, need to construct
some cases for them.  May I postpone it a bit?  Or you prefer me to support
it here?

>> +	  else if (param_vect_with_length_scope == 0)
>> +	    LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo) = false;
> 
> As above, I don't think this should be length-specific.  Same for the
> == 1 handling, which we could do afterwards.
> 

OK.  Will adjust it.

[snip]
>> +
>> +      if (LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo))
>> +	dump_printf_loc (MSG_NOTE, vect_location,
>> +			 "using length-based partial"
>> +			 " vectors for loop fully.\n");
>> +      else
>> +	dump_printf_loc (MSG_NOTE, vect_location,
>> +			 "not using length-based partial"
>> +			 " vectors for loop fully.\n");
> 
> Think just one message for all three cases is better, perhaps with
> 
>   "operating only on full vectors.\n"
> 
> instead of "not using a fully-masked loop.\n".  Might need some
> testsuite updates though -- probably worth splitting the wording
> change out into a separate patch if so.
> 

OK, will check and split it if need.

[snip]

>> +
>> +void
>> +vect_record_loop_len (loop_vec_info loop_vinfo, vec_loop_lens *lens,
>> +		      unsigned int nvectors, tree vectype, unsigned int factor)
>> +{
>> +  gcc_assert (nvectors != 0);
>> +  if (lens->length () < nvectors)
>> +    lens->safe_grow_cleared (nvectors);
>> +  rgroup_controls *rgl = &(*lens)[nvectors - 1];
>> +
>> +  /* The number of scalars per iteration, scalar occupied bytes and
>> +     the number of vectors are both compile-time constants.  */
>> +  unsigned int nscalars_per_iter
>> +    = exact_div (nvectors * TYPE_VECTOR_SUBPARTS (vectype),
>> +		 LOOP_VINFO_VECT_FACTOR (loop_vinfo)).to_constant ();
>> +
>> +  if (rgl->max_nscalars_per_iter < nscalars_per_iter)
>> +    {
>> +      rgl->max_nscalars_per_iter = nscalars_per_iter;
>> +      rgl->type = vectype;
>> +      rgl->factor = factor;
>> +    }
> 
> This is dangerous because it ignores “factor” otherwise, and ignores
> the previous factor if we overwrite it.
> 
> I think instead we should have:
> 
>   /* For now, we only support cases in which all loads and stores fall back
>      to VnQI or none do.  */
>   gcc_assert (!rgl->max_nscalars_per_iter
> 	      || (rgl->factor == 1 && factor == 1)
> 	      || (rgl->max_nscalars_per_iter * rgl->factor
> 		  == nscalars_per_iter * factor));
> 
> before changing rgl.
> 

Thanks for pointing out this!  Will guard it.

[snip]
>>  	      if (vec_mask)
>>  		final_mask = prepare_load_store_mask (mask_vectype, final_mask,
>>  						      vec_mask, gsi);
>> @@ -7994,6 +8057,34 @@ vectorizable_store (vec_info *vinfo,
>>  		  vect_finish_stmt_generation (vinfo, stmt_info, call, gsi);
>>  		  new_stmt = call;
>>  		}
>> +	      else if (final_len)
>> +		{
>> +		  align = least_bit_hwi (misalign | align);
>> +		  tree ptr = build_int_cst (ref_type, align);
>> +		  tree vtype = TREE_TYPE (vec_oprnd);
> 
> Couldn't you just reuse “vectype”?  Worth a comment if not.
> 

Yeah, will replace with it.

[snip]

>> @@ -9850,11 +9986,30 @@ vectorizable_condition (vec_info *vinfo,
>>  	  return false;
>>  	}
>>  
>> -      if (loop_vinfo
>> -	  && LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo)
>> -	  && reduction_type == EXTRACT_LAST_REDUCTION)
>> -	vect_record_loop_mask (loop_vinfo, &LOOP_VINFO_MASKS (loop_vinfo),
>> -			       ncopies * vec_num, vectype, NULL);
>> +      if (loop_vinfo && for_reduction
>> +	  && LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo))
>> +	{
>> +	  if (reduction_type == EXTRACT_LAST_REDUCTION)
>> +	    vect_record_loop_mask (loop_vinfo, &LOOP_VINFO_MASKS (loop_vinfo),
>> +				   ncopies * vec_num, vectype, NULL);
>> +	  /* Using partial vectors can introduce inactive lanes in the last
>> +	     iteration, since full vector of condition results are operated,
>> +	     it's unsafe here.  But if we can AND the condition mask with
>> +	     loop mask, it would be safe then.  */
>> +	  else if (!loop_vinfo->scalar_cond_masked_set.is_empty ())
>> +	    {
>> +	      scalar_cond_masked_key cond (cond_expr, ncopies * vec_num);
>> +	      if (!loop_vinfo->scalar_cond_masked_set.contains (cond))
>> +		{
>> +		  bool honor_nans = HONOR_NANS (TREE_TYPE (cond.op0));
>> +		  cond.code = invert_tree_comparison (cond.code, honor_nans);
>> +		  if (!loop_vinfo->scalar_cond_masked_set.contains (cond))
>> +		    LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo) = false;
>> +		}
>> +	    }
>> +	  else
>> +	    LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo) = false;
>> +	}
>>  
>>        STMT_VINFO_TYPE (stmt_info) = condition_vec_info_type;
>>        vect_model_simple_cost (vinfo, stmt_info, ncopies, dts, ndts, slp_node,
> 
> I don't understand this part.

This is for the regression case on aarch64:

PASS->FAIL: gcc.target/aarch64/sve/reduc_8.c -march=armv8.2-a+sve  scan-assembler-not \\tcmpeq\\tp[0-9]+\\.s,

As you mentioned before, we would expect to record masks for partial vectors reduction, 
otherwise the inactive lanes would be possibly unsafe.  For this failed case, the
reduction_type is TREE_CODE_REDUCTION, we won't record loop mask.  But it's still safe
since the mask is further AND with some loop mask.  The difference looks like:

Without mask AND loop mask optimization:

  loop_mask =...
  v1 = .MASK_LOAD (a, loop_mask)
  mask1 = v1 == {cst, ...}                // unsafe since it's generate from full width.
  mask2 = loop_mask & mask1               // safe, since it's AND with loop mask?
  v2 = .MASK_LOAD (b, mask2)
  vres = VEC_COND_EXPR < mask1, vres, v2> // unsafe coz of mask1

With mask AND loop mask optimization:

  loop_mask =...
  v1 = .MASK_LOAD (a, loop_mask)
  mask1 = v1 == {cst, ...}
  mask2 = loop_mask & mask1       
  v2 = .MASK_LOAD (b, mask2)
  vres = VEC_COND_EXPR < mask2, vres, v2> // safe coz of mask2?

The loop mask ANDing can make unsafe inactive lanes safe.  So the fix here is to further check
it's possible to be optimized further, if it can, we can know it's safe.  Does it make sense?

> 
>> @@ -11910,3 +12065,36 @@ vect_get_vector_types_for_stmt (vec_info *vinfo, stmt_vec_info stmt_info,
>>    *nunits_vectype_out = nunits_vectype;
>>    return opt_result::success ();
>>  }
>> +
>> +/* Generate and return statement sequence that sets vector length LEN that is:
>> +
>> +   min_of_start_and_end = min (START_INDEX, END_INDEX);
>> +   left_len = END_INDEX - min_of_start_and_end;
>> +   rhs = min (left_len, LEN_LIMIT);
>> +   LEN = rhs;
>> +
>> +   TODO: for now, rs6000 supported vector with length only cares 8-bits, which
>> +   means if we have left_len in bytes larger than 255, it can't be saturated to
>> +   vector limit (vector size).  One target hook can be provided if other ports
>> +   don't suffer this.
>> +*/
> 
> Should be no line break before the */
> 
> Personally I think it'd be better to drop the TODO.  This isn't the only
> place that would need to change if we allowed out-of-range lengths,
> whereas the comment might give the impression that it is.
> 

Sorry I might miss something, but all undetermined lengths are generated here,
the other places you meant is doc or elsewhere?

>> +
>> +gimple_seq
>> +vect_gen_len (tree len, tree start_index, tree end_index, tree len_limit)
>> +{
>> +  gimple_seq stmts = NULL;
>> +  tree len_type = TREE_TYPE (len);
>> +  gcc_assert (TREE_TYPE (start_index) == len_type);
>> +
>> +  tree min = fold_build2 (MIN_EXPR, len_type, start_index, end_index);
>> +  tree left_len = fold_build2 (MINUS_EXPR, len_type, end_index, min);
>> +  left_len = fold_build2 (MIN_EXPR, len_type, left_len, len_limit);
>> +
>> +  tree rhs = force_gimple_operand (left_len, &stmts, true, NULL_TREE);
>> +  gimple *new_stmt = gimple_build_assign (len, rhs);
>> +  gimple_stmt_iterator i = gsi_last (stmts);
>> +  gsi_insert_after_without_update (&i, new_stmt, GSI_CONTINUE_LINKING);
>> +
>> +  return stmts;
>> +}
> 
> It's better to build this up using gimple_build instead.
> 

Will fix it.

[snip]

>> +  bool epil_using_partial_vectors_p;
>> +
>>    /* When we have grouped data accesses with gaps, we may introduce invalid
>>       memory accesses.  We peel the last iteration of the loop to prevent
>>       this.  */
> 
> Thanks,
> Richard
> 

BR,
Kewen