[PATCH] Fix PR82255 (vectorizer cost model overcounts some vector load costs)

Richard Sandiford richard.sandiford@linaro.org
Wed Sep 20 08:49:00 GMT 2017


Bill Schmidt <wschmidt@linux.vnet.ibm.com> writes:
> On Sep 19, 2017, at 4:21 PM, Bill Schmidt <wschmidt@linux.vnet.ibm.com> wrote:
>> On Sep 19, 2017, at 3:58 PM, Richard Sandiford
>> <richard.sandiford@linaro.org> wrote:
>>> Bill Schmidt <wschmidt@linux.vnet.ibm.com> writes:
>>>> Index: gcc/tree-vect-stmts.c
>>>> ===================================================================
>>>> --- gcc/tree-vect-stmts.c	(revision 252760)
>>>> +++ gcc/tree-vect-stmts.c	(working copy)
>>>> @@ -1091,8 +1091,19 @@ vect_model_load_cost (stmt_vec_info stmt_info, int
>>>> 			prologue_cost_vec, body_cost_vec, true);
>>>>  if (memory_access_type == VMAT_ELEMENTWISE
>>>>      || memory_access_type == VMAT_STRIDED_SLP)
>>>> -    inside_cost += record_stmt_cost (body_cost_vec, ncopies, vec_construct,
>>>> -				     stmt_info, 0, vect_body);
>>>> +    {
>>>> +      int group_size = GROUP_SIZE (stmt_info);
>>>> +      int nunits = TYPE_VECTOR_SUBPARTS (STMT_VINFO_VECTYPE (stmt_info));
>>>> +      if (group_size < nunits)
>>>> +	{
>>>> +	  if (dump_enabled_p ())
>>>> +	    dump_printf_loc (MSG_NOTE, vect_location,
>>>> +			     "vect_model_load_cost: vec_construct required");
>>>> +	  inside_cost += record_stmt_cost (body_cost_vec, ncopies,
>>>> +					   vec_construct, stmt_info, 0,
>>>> +					   vect_body);
>>>> +	}
>>>> +    }
>>>> 
>>>>  if (dump_enabled_p ())
>>>>    dump_printf_loc (MSG_NOTE, vect_location,
>>> 
>>> This feels like we've probably got the wrong memory_access_type.
>>> If it's a just a contiguous load then it should be VMAT_CONTIGUOUS
>>> instead.
>> 
>> I tend to agree -- that will take more surgery to the code that
>> detects strided loads in
>> both the cost model analysis and in vectorizable_load.
>> 
> It's just a little less clear how to clean this up in
> vect_analyze_data_refs.  The too-simplistic test there now is:
>
>       else if (is_a <loop_vec_info> (vinfo)
>                && TREE_CODE (DR_STEP (dr)) != INTEGER_CST)
>         {
>           if (nested_in_vect_loop_p (loop, stmt))
>             {
>               if (dump_enabled_p ())
> 		{
>                   dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
>                                    "not vectorized: not suitable for strided "
>                                    "load ");
>                   dump_gimple_stmt (MSG_MISSED_OPTIMIZATION, TDF_SLIM, stmt, 0);
>                 }
>               return false;
>             }
>           STMT_VINFO_STRIDED_P (stmt_info) = true;
>         }
>
> Furthermore, this is being set while processing each data reference, not after
> all information has been gathered and analyzed.  So this is too early.

Yeah, agree we can't tell at this stage.

> Next place to fix this would be in vect_analyze_slp_cost_1, where
>
>       vect_memory_access_type memory_access_type
>         = (STMT_VINFO_STRIDED_P (stmt_info)
>            ? VMAT_STRIDED_SLP
>            : VMAT_CONTIGUOUS);
>
> is too simplistic.  I guess we could do something here, but it would
> require unpacking all the grouped accesses and duplicating all the
> logic from scratch to determine whether it's a single load or not.

I think we can use:

    SLP_INSTANCE_UNROLLING_FACTOR (instance) * ncopies_for_cost

to calculate the number of times that the vector SLP statements repeat
the original scalar group.  We should then only use STRIDED_SLP if
that's greater than 1.

This affects stores as well as loads, so fixing it here will work
for both.

But I think this shows up another problem.  In the vectorised loop,
we have 1 copy of the load and 4 copies of the ABS (after unpacking).
But vect_analyze_slp_cost_1 is being called with ncopies_for_cost == 4
even for the loads.  So I think it's a double whammy: we've quadrupling
the real cost via the excessive ncopies_for_cost, and we're using the
wrong access type too.

The patch below should fix the second problem but isn't much use without
the first.  I think we need to adjust ncopies_for_cost in the recursive
calls as well.

Of course, fixing an overestimate here is likely to expose an
underestimate elsewhere. :-)

Interesting test case though.  It seems counterproductive to unroll
a loop like that before vectorisation.

Thanks,
Richard


Index: gcc/tree-vect-slp.c
===================================================================
--- gcc/tree-vect-slp.c	(revision 252986)
+++ gcc/tree-vect-slp.c	(working copy)
@@ -1687,8 +1687,10 @@
   stmt_info = vinfo_for_stmt (stmt);
   if (STMT_VINFO_GROUPED_ACCESS (stmt_info))
     {
+      unsigned int group_ncopies = (SLP_INSTANCE_UNROLLING_FACTOR (instance)
+				   * ncopies_for_cost);
       vect_memory_access_type memory_access_type
-	= (STMT_VINFO_STRIDED_P (stmt_info)
+	= (STMT_VINFO_STRIDED_P (stmt_info) && group_ncopies > 1
 	   ? VMAT_STRIDED_SLP
 	   : VMAT_CONTIGUOUS);
       if (DR_IS_WRITE (STMT_VINFO_DATA_REF (stmt_info)))



More information about the Gcc-patches mailing list