[vect] Re-analyze all modes for epilogues

Tue Dec 14 11:39:18 GMT 2021

"Andre Vieira (lists)" <andre.simoesdiasvieira@arm.com> writes:
> Hi,
>
> Added an extra step to skip unusable epilogue modes when we know the 
> target does not support predication. This uses a new function 
> 'support_predication_p' that is generated at build time and checks 
> whether the target supports at least one optab that can be used for 
> predicated code-generation.
>
> Bootstrapped and regression tested on aarch64-none-linux-gnu.
>
> OK for trunk?

Looks good, but see the final comment below about whether we could
simplify this a bit.

> gcc/ChangeLog:
>
>          * tree-vect-loop.c (vect_better_loop_vinfo_p): Round factors up 
> for epilogue costing.
>          (vect_analyze_loop): Re-analyze all modes for epilogues, unless 
> we are guaranteed that no
>          predication is possible.
>          (genopinit.c) (support_predication_p): Generate new function.
>
> gcc/testsuite/ChangeLog:
>
>          * gcc.target/aarch64/masked_epilogue.c: New test.
>
> diff --git a/gcc/genopinit.c b/gcc/genopinit.c
> index 195ddf74fa2b7d89760622073dcec9d5d339a097..e0958bc6c849911395341611a53b0fcb69565827 100644
> --- a/gcc/genopinit.c
> +++ b/gcc/genopinit.c
> @@ -321,6 +321,7 @@ main (int argc, const char **argv)
>  	   "  bool supports_vec_scatter_store_cached;\n"
>  	   "};\n"
>  	   "extern void init_all_optabs (struct target_optabs *);\n"
> +	   "extern bool support_predication_p (void);\n"

len_load and len_store aren't really predication (or masking).
So maybe:

  partial_vectors_supported_p

?

>  	   "\n"
>  	   "extern struct target_optabs default_target_optabs;\n"
>  	   "extern struct target_optabs *this_fn_optabs;\n"
> @@ -373,6 +374,33 @@ main (int argc, const char **argv)
>      fprintf (s_file, "  ena[%u] = HAVE_%s;\n", i, p->name);
>    fprintf (s_file, "}\n\n");
>  
> +  fprintf (s_file,
> +	   "/* Returns TRUE if the target supports any of the predication\n"
> +	   "   specific optabs: while_ult_optab, len_load_optab or len_store_optab,\n"
> +	   "   for any mode.  */\n"

Similarly here, s/predication specific optabs/partial vector optabs/.

> +	   "bool\nsupport_predication_p (void)\n{\n");
> +  bool any_match = false;
> +  fprintf (s_file, "\treturn");
> +  bool first = true;
> +  for (i = 0; patterns.iterate (i, &p); ++i)
> +    {
> +#define CMP_NAME(N) !strncmp (p->name, (N), strlen ((N)))
> +      if (CMP_NAME("while_ult") || CMP_NAME ("len_load")
> +	  || CMP_NAME ("len_store"))
> +	{
> +	  if (first)
> +	    fprintf (s_file, " HAVE_%s", p->name);
> +	  else
> +	    fprintf (s_file, " || HAVE_%s", p->name);
> +	  first = false;
> +	  any_match = true;
> +	}
> +    }
> +  if (!any_match)
> +    fprintf (s_file, " false");
> +  fprintf (s_file, ";\n}\n");
> +
> +
>    /* Perform a binary search on a pre-encoded optab+mode*2.  */
>    /* ??? Perhaps even better to generate a minimal perfect hash.
>       Using gperf directly is awkward since it's so geared to working
> diff --git a/gcc/testsuite/gcc.target/aarch64/masked_epilogue.c b/gcc/testsuite/gcc.target/aarch64/masked_epilogue.c
> new file mode 100644
> index 0000000000000000000000000000000000000000..286a7be236f337fee4c4650f42da72000855c5e6
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/masked_epilogue.c
> @@ -0,0 +1,10 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -ftree-vectorize -fdump-tree-vect-details -march=armv8-a+sve -msve-vector-bits=scalable" } */
> +
> +void f(unsigned char y[restrict],
> +       unsigned char x[restrict], int n) {
> +  for (int i = 0; i < n; ++i)
> +    y[i] = (y[i] + x[i] + 1) >> 1;
> +}
> +
> +/* { dg-final { scan-tree-dump {LOOP EPILOGUE VECTORIZED \(MODE=VNx} "vect" } } */
> diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c
> index a28bb6321d76b8222bc8cfdade151ca9b4dca406..86e0cb47aef2919fdf7d87228f7f6a8378893e68 100644
> --- a/gcc/tree-vect-loop.c
> +++ b/gcc/tree-vect-loop.c
> @@ -2824,11 +2824,13 @@ vect_better_loop_vinfo_p (loop_vec_info new_loop_vinfo,
>  	{
>  	  unsigned HOST_WIDE_INT main_vf_max
>  	    = estimated_poly_value (main_poly_vf, POLY_VALUE_MAX);
> +	  unsigned HOST_WIDE_INT old_vf_max
> +	    = estimated_poly_value (old_vf, POLY_VALUE_MAX);
> +	  unsigned HOST_WIDE_INT new_vf_max
> +	    = estimated_poly_value (new_vf, POLY_VALUE_MAX);
>  
> -	  old_factor = main_vf_max / estimated_poly_value (old_vf,
> -							   POLY_VALUE_MAX);
> -	  new_factor = main_vf_max / estimated_poly_value (new_vf,
> -							   POLY_VALUE_MAX);
> +	  old_factor = CEIL (main_vf_max, old_vf_max);
> +	  new_factor = CEIL (main_vf_max, new_vf_max);
>  
>  	  /* If the loop is not using partial vectors then it will iterate one
>  	     time less than one that does.  It is safe to subtract one here,
> @@ -3069,8 +3071,6 @@ vect_analyze_loop (class loop *loop, vec_info_shared *shared)
>    machine_mode autodetected_vector_mode = VOIDmode;
>    opt_loop_vec_info first_loop_vinfo = opt_loop_vec_info::success (NULL);
>    unsigned int mode_i = 0;
> -  unsigned int first_loop_i = 0;
> -  unsigned int first_loop_next_i = 0;
>    unsigned HOST_WIDE_INT simdlen = loop->simdlen;
>  
>    /* First determine the main loop vectorization mode, either the first
> @@ -3079,7 +3079,6 @@ vect_analyze_loop (class loop *loop, vec_info_shared *shared)
>       lowest cost if pick_lowest_cost_p.  */
>    while (1)
>      {
> -      unsigned int loop_vinfo_i = mode_i;
>        bool fatal;
>        opt_loop_vec_info loop_vinfo
>  	= vect_analyze_loop_1 (loop, shared, &loop_form_info,
> @@ -3108,11 +3107,7 @@ vect_analyze_loop (class loop *loop, vec_info_shared *shared)
>  	      first_loop_vinfo = opt_loop_vec_info::success (NULL);
>  	    }
>  	  if (first_loop_vinfo == NULL)
> -	    {
> -	      first_loop_vinfo = loop_vinfo;
> -	      first_loop_i = loop_vinfo_i;
> -	      first_loop_next_i = mode_i;
> -	    }
> +	    first_loop_vinfo = loop_vinfo;
>  	  else
>  	    {
>  	      delete loop_vinfo;
> @@ -3158,30 +3153,40 @@ vect_analyze_loop (class loop *loop, vec_info_shared *shared)
>    /* Now analyze first_loop_vinfo for epilogue vectorization.  */
>    poly_uint64 lowest_th = LOOP_VINFO_VERSIONING_THRESHOLD (first_loop_vinfo);
>  
> -  /* Handle the case that the original loop can use partial
> -     vectorization, but want to only adopt it for the epilogue.
> -     The retry should be in the same mode as original.  */
> -  if (LOOP_VINFO_EPIL_USING_PARTIAL_VECTORS_P (first_loop_vinfo))
> -    {
> -      gcc_assert (LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (first_loop_vinfo)
> -		  && !LOOP_VINFO_USING_PARTIAL_VECTORS_P (first_loop_vinfo));
> -      if (dump_enabled_p ())
> -	dump_printf_loc (MSG_NOTE, vect_location,
> -			 "***** Re-trying analysis with same vector mode"
> -			 " %s for epilogue with partial vectors.\n",
> -			 GET_MODE_NAME (first_loop_vinfo->vector_mode));
> -      mode_i = first_loop_i;
> -    }
> -  else
> -    {
> -      mode_i = first_loop_next_i;
> -      if (mode_i == vector_modes.length ())
> -	return first_loop_vinfo;
> -    }
> +  /* For epilogues start the analysis from the first mode.  The motivation
> +     behind starting from the beginning comes from cases where the VECTOR_MODES
> +     array may contain length agnostic and length fixed modes.  Their ordering

length-agnostic and length-specific

> +     is not guaranteed, so we could end up picking a mode for the main loop
> +     that is after the epilogue's optimal mode.  */

That's true, and good enough on its own, but FTR: this could also happen
for non-SVE cases.  E.g. the user could use simdlen to force a particular
loop to use long vectors even when the current target prefers shorter
vectors over longer vectors.  The loop could still benefit from using
the higher-priority shorter vectors for the epilogue.

No need to change the comment to say that.  Just thought it was worth
saying here for the record.

> +  mode_i = 0;

I think this should be 1 if vector_modes.length () > 1.  The documentation
for autovectorize_vector_modes says:

  The first mode should be the @code{TARGET_VECTORIZE_PREFERRED_SIMD_MODE}
  for its element mode.

(although that probably amounts to a self-quote :-/).  That should
help with the change below.

> +
> +  /* If the target does not support predication we can shorten the number of
> +     modes to analyze for the epilogue as we know we can't pick a mode that has
> +     at least as many NUNITS as the main loop's vectorization factor, since
> +     that would imply the epilogue's vectorization factor would be at least as
> +     high as the main loop's.  */
> +  poly_uint64 first_vinfo_vf = LOOP_VINFO_VECT_FACTOR (first_loop_vinfo);
> +  if (!support_predication_p ())
> +    {
> +      machine_mode epil_mode = mode_i == 0
> +			       ? autodetected_vector_mode
> +			       : vector_modes[mode_i];
> +      while (maybe_ge (GET_MODE_NUNITS (epil_mode), first_vinfo_vf)
> +	     && mode_i < (vector_modes.length() - 1))
> +	epil_mode = vector_modes[++mode_i];
> +    }
> +
> +  if (dump_enabled_p () && mode_i != 0)
> +    dump_printf_loc (MSG_NOTE, vect_location,
> +		     "***** Re-trying analysis for epilogue using mode %s\n.",
> +		     GET_MODE_NAME (vector_modes[mode_i]));
> +  else if (dump_enabled_p ())
> +    dump_printf_loc (MSG_NOTE, vect_location,
> +		     "***** Re-trying analysis for epilogue using mode"
> +		     " autodetection vector mode.\n");
>    /* ???  If first_loop_vinfo was using VOIDmode then we probably
>       want to instead search for the corresponding mode in vector_modes[].  */
> -

Could we just stick a continue in the loop below instead?  That seems
like it could be simpler and more general, in that it would skip any
mode that is too big, not just leading modes.

It's probably worth computing support_predication_p () outside the
loop though.

I think this patch removes the need for the ???.  We're doing the
kind of skip that the ??? anticipated.

Thanks,
Richard

>    while (1)
>      {
>        bool fatal;