This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [RFC][PR88838][SVE] Use 32-bit WHILELO in LP64 mode


Kugan Vivekanandarajah <kugan.vivekanandarajah@linaro.org> writes:
> diff --git a/gcc/tree-vect-loop-manip.c b/gcc/tree-vect-loop-manip.c
> index 77d3dac..d6452a1 100644
> --- a/gcc/tree-vect-loop-manip.c
> +++ b/gcc/tree-vect-loop-manip.c
> @@ -418,7 +418,20 @@ vect_set_loop_masks_directly (struct loop *loop, loop_vec_info loop_vinfo,
>    tree mask_type = rgm->mask_type;
>    unsigned int nscalars_per_iter = rgm->max_nscalars_per_iter;
>    poly_uint64 nscalars_per_mask = TYPE_VECTOR_SUBPARTS (mask_type);
> -
> +  bool convert = false;
> +  tree iv_type = NULL_TREE;
> +
> +  /* If the compare_type is not with Pmode size, we will create an IV with
> +     Pmode size with truncated use (i.e. converted to the correct type).
> +     This is because using Pmode allows ivopts to reuse the IV for indices
> +     (in the loads and store).  */
> +  if (known_lt (GET_MODE_BITSIZE (TYPE_MODE (compare_type)),
> +		GET_MODE_BITSIZE (Pmode)))
> +    {
> +      iv_type = build_nonstandard_integer_type (GET_MODE_BITSIZE (Pmode),
> +						true);
> +      convert = true;
> +    }
>    /* Calculate the maximum number of scalar values that the rgroup
>       handles in total, the number that it handles for each iteration
>       of the vector loop, and the number that it should skip during the
> @@ -444,12 +457,43 @@ vect_set_loop_masks_directly (struct loop *loop, loop_vec_info loop_vinfo,
>       processed.  */
>    tree index_before_incr, index_after_incr;
>    gimple_stmt_iterator incr_gsi;
> +  gimple_stmt_iterator incr_gsi2;
>    bool insert_after;
> -  tree zero_index = build_int_cst (compare_type, 0);
> +  tree zero_index;
>    standard_iv_increment_position (loop, &incr_gsi, &insert_after);
> -  create_iv (zero_index, nscalars_step, NULL_TREE, loop, &incr_gsi,
> -	     insert_after, &index_before_incr, &index_after_incr);
>  
> +  if (convert)
> +    {
> +      /* If we are creating IV of Pmode type and converting.  */
> +      zero_index = build_int_cst (iv_type, 0);
> +      tree step = build_int_cst (iv_type,
> +				 LOOP_VINFO_VECT_FACTOR (loop_vinfo));
> +      /* Creating IV of Pmode type.  */
> +      create_iv (zero_index, step, NULL_TREE, loop, &incr_gsi,
> +		 insert_after, &index_before_incr, &index_after_incr);
> +      /* Create truncated index_before and after increament.  */
> +      tree index_before_incr_trunc = make_ssa_name (compare_type);
> +      tree index_after_incr_trunc = make_ssa_name (compare_type);
> +      gimple *incr_before_stmt = gimple_build_assign (index_before_incr_trunc,
> +						      NOP_EXPR,
> +						      index_before_incr);
> +      gimple *incr_after_stmt = gimple_build_assign (index_after_incr_trunc,
> +						     NOP_EXPR,
> +						     index_after_incr);
> +      incr_gsi2 = incr_gsi;
> +      gsi_insert_before (&incr_gsi2, incr_before_stmt, GSI_NEW_STMT);
> +      gsi_insert_after (&incr_gsi, incr_after_stmt, GSI_NEW_STMT);
> +      index_before_incr = index_before_incr_trunc;
> +      index_after_incr = index_after_incr_trunc;
> +      zero_index = build_int_cst (compare_type, 0);
> +    }
> +  else
> +    {
> +      /* If the IV is of Pmode compare_type, no convertion needed.  */
> +      zero_index = build_int_cst (compare_type, 0);
> +      create_iv (zero_index, nscalars_step, NULL_TREE, loop, &incr_gsi,
> +		 insert_after, &index_before_incr, &index_after_incr);
> +    }
>    tree test_index, test_limit, first_limit;
>    gimple_stmt_iterator *test_gsi;
>    if (might_wrap_p)

Instead of hard-coding Pmode as a special case here, I think we should
record the IV type in vect_verify_full_masking in addition to the comparison
type.  (With the IV type always being at least as wide as the comparison
type.)

> diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c
> index bd81193..2769c86 100644
> --- a/gcc/tree-vect-loop.c
> +++ b/gcc/tree-vect-loop.c
> @@ -1035,6 +1035,30 @@ vect_verify_full_masking (loop_vec_info loop_vinfo)
>    /* Find a scalar mode for which WHILE_ULT is supported.  */
>    opt_scalar_int_mode cmp_mode_iter;
>    tree cmp_type = NULL_TREE;
> +  tree niters_type = TREE_TYPE (LOOP_VINFO_NITERS (loop_vinfo));
> +  tree niters_skip = LOOP_VINFO_MASK_SKIP_NITERS (loop_vinfo);
> +  unsigned HOST_WIDE_INT max_vf = vect_max_vf (loop_vinfo);
> +  widest_int iv_limit;
> +  bool known_max_iters = max_loop_iterations (loop, &iv_limit);
> +  if (known_max_iters)
> +    {
> +      if (niters_skip)
> +	{
> +	  /* Add the maximum number of skipped iterations to the
> +	     maximum iteration count.  */
> +	  if (TREE_CODE (niters_skip) == INTEGER_CST)
> +	    iv_limit += wi::to_widest (niters_skip);
> +	  else
> +	    iv_limit += max_vf - 1;
> +	}
> +      /* IV_LIMIT is the maximum number of latch iterations, which is also
> +	 the maximum in-range IV value.  Round this value down to the previous
> +	 vector alignment boundary and then add an extra full iteration.  */
> +      poly_uint64 vf = LOOP_VINFO_VECT_FACTOR (loop_vinfo);
> +      iv_limit = (iv_limit & -(int) known_alignment (vf)) + max_vf;
> +    }
> +
> +  /* Get the vectorization factor in tree form.  */

Please split the loop-manip.c bits you need out into a subroutine instead
of cut-&-pasting. :-)

>    FOR_EACH_MODE_IN_CLASS (cmp_mode_iter, MODE_INT)
>      {
>        unsigned int cmp_bits = GET_MODE_BITSIZE (cmp_mode_iter.require ());
> @@ -1045,12 +1069,23 @@ vect_verify_full_masking (loop_vec_info loop_vinfo)
>  	  if (this_type
>  	      && can_produce_all_loop_masks_p (loop_vinfo, this_type))
>  	    {
> +	      /* See whether zero-based IV would ever generate all-false masks
> +		 before wrapping around.  */
> +	      bool might_wrap_p
> +		= (!known_max_iters
> +		   || (wi::min_precision
> +		       (iv_limit
> +			* vect_get_max_nscalars_per_iter (loop_vinfo),
> +			UNSIGNED) > cmp_bits));

The wi::min_precision is invariant, so there's no need to calculate it in
each iteration of the loop.  Would be good to avoid the duplicated call to
vect_get_max_nscalars_per_iter (also called for min_ni_width).

>  	      /* Although we could stop as soon as we find a valid mode,
>  		 it's often better to continue until we hit Pmode, since the
>  		 operands to the WHILE are more likely to be reusable in
> -		 address calculations.  */
> +		 address calculations.  Unless the limit is extended from
> +		 this_type.  */

Please rewrite the comment to describe the new code rather than tacking
this kind of thing on the end.

>  	      cmp_type = this_type;
> -	      if (cmp_bits >= GET_MODE_BITSIZE (Pmode))
> +	      if (cmp_bits >= GET_MODE_BITSIZE (Pmode)
> +		  || (!might_wrap_p
> +		      && (cmp_bits == TYPE_PRECISION (niters_type))))

The TYPE_PRECISION test looks redundant.  E.g. if a loop only executes
N times, all we care about is whether the IV we pick can handle N
iterations without wrapping.  It doesn't really matter what type the
original source code used to hold the loop count.

Thanks,
Richard


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]